In PySpark data frames, we can have columns with arrays. Let’s see an example of an array column. First, we will load the CSV file from S3.
# read the data from the S3 df = spark.read.options(header=True).csv("s3://my-bucket/my_folder/my_file.csv") # select the Row_Number and Category column df.select(['Row_Number', 'Category']).show(5)
Assume that we want to create a new column called ‘Categories‘ where all the categories will appear in an array. We can easily achieve that by using the split() function from functions.
from pyspark.sql import functions as F df_new = df.withColumn('Categories', F.split(df.Category, '\|')) df_new = df_new.select(['Row_Number', 'Category', 'Categories']) df_new.show(5)
We can confirm that the “Categories” column is an “array” data type.
df_new.printSchema()
root
|-- Row_Number: string (nullable = true)
|-- Category: string (nullable = true)
|-- Categories: array (nullable = true)
| |-- element: string (containsNull = true)
Get the First Element of an Array
Let’s see some cool things that we can do with the arrays, like getting the first element. We will need to use the getItem() function as follows:
df_new.withColumn('First_Item', df_new.Categories.getItem(0)).show(5)
Get the Number of Elements of an Array
We can get the size of an array using the size() function.
df_new.withColumn('Elements', F.size('Categories')).show(5)
Get the Last Element of an Array
We can get the last element of the array by using a combination of getItem() and size() function as follows:
df_new.withColumn('Last_Item', df_new.Categories.getItem(F.size('Categories')-1)).show(5)