Predictive Hacks

Arrays in PySpark

In PySpark data frames, we can have columns with arrays. Let’s see an example of an array column. First, we will load the CSV file from S3.

# read the data from the S3
df ="s3://my-bucket/my_folder/my_file.csv")

# select the Row_Number and Category column['Row_Number', 'Category']).show(5)

Assume that we want to create a new column called ‘Categories‘ where all the categories will appear in an array. We can easily achieve that by using the split() function from functions.

from pyspark.sql import functions as F

df_new = df.withColumn('Categories', F.split(df.Category, '\|'))
df_new =['Row_Number', 'Category', 'Categories'])

We can confirm that the “Categories” column is an “array” data type.

 |-- Row_Number: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Categories: array (nullable = true)
 |    |-- element: string (containsNull = true)

Get the First Element of an Array

Let’s see some cool things that we can do with the arrays, like getting the first element. We will need to use the getItem() function as follows:

df_new.withColumn('First_Item', df_new.Categories.getItem(0)).show(5)

Get the Number of Elements of an Array

We can get the size of an array using the size() function.

df_new.withColumn('Elements', F.size('Categories')).show(5)

Get the Last Element of an Array

We can get the last element of the array by using a combination of getItem() and size() function as follows:

df_new.withColumn('Last_Item', df_new.Categories.getItem(F.size('Categories')-1)).show(5)

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore


Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.


Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s