Predictive Hacks

How to add new columns to PySpark Data Frames

In this example, we will show you alternative ways to add new columns to PySpark Data Frames. We will start our exampling by loading a CSV files from an S3 bucket.

df = spark.read.options(header=True).csv("s3://my-bucket/my_data_folder/")
df.printSchema()

root
 |-- Row_Number: string (nullable = true)
 |-- Hash_Name: string (nullable = true)
 |-- Event_Date: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Class_A: string (nullable = true)
 |-- Class_B: string (nullable = true)
 |-- Class_C: string (nullable = true)
 |-- File_Path: string (nullable = true)

Add a Column with withColumn()

The most common way to add columns in PySpark is with the withColumn() method. Le’s create a new column, called “Double_Rating” which is the “Rating” times 2.

new_df = df.withColumn('Double_Rating', df.Rating*2)
new_df.select('Double_Rating', 'Rating').show(5)

We can create a constant column using the lit() function. We will create a column called “Constant_5” that takes the 5 value.

from pyspark.sql.functions import lit

new_df = df.withColumn('Constant_5', lit(5))

new_df.select('Constant_5', 'Rating').show(5)

Finally, we can create an empty column as follows:

new_df = df.withColumn('Empty', lit(None))

new_df.select('Empty', 'Rating').show(5)

Add a Column with select()

We can create a new column name using the select statement as follows:

df.select('Rating', (df.Rating*2).alias('Double_Rating')).show(5)

Add a Column with SQL Statements

Finally, there is an alternative way to create columns by running SQL statements. In this case, we will need to create a temporary view first and then run the SQL select statement.

# create a temp table called mytable
df.createOrReplaceTempView('mytable')
spark.sql('SELECT Rating, Rating*2 as Double_Rating FROM mytable').show(5)

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s