In this example, we will show you alternative ways to add new columns to PySpark Data Frames. We will start our exampling by loading a CSV files from an S3 bucket.
df = spark.read.options(header=True).csv("s3://my-bucket/my_data_folder/") df.printSchema()
root
|-- Row_Number: string (nullable = true)
|-- Hash_Name: string (nullable = true)
|-- Event_Date: string (nullable = true)
|-- Rating: string (nullable = true)
|-- Category: string (nullable = true)
|-- Class_A: string (nullable = true)
|-- Class_B: string (nullable = true)
|-- Class_C: string (nullable = true)
|-- File_Path: string (nullable = true)
Add a Column with withColumn()
The most common way to add columns in PySpark is with the withColumn()
method. Le’s create a new column, called “Double_Rating” which is the “Rating” times 2.
new_df = df.withColumn('Double_Rating', df.Rating*2) new_df.select('Double_Rating', 'Rating').show(5)
We can create a constant column using the lit()
function. We will create a column called “Constant_5” that takes the 5 value.
from pyspark.sql.functions import lit new_df = df.withColumn('Constant_5', lit(5)) new_df.select('Constant_5', 'Rating').show(5)
Finally, we can create an empty column as follows:
new_df = df.withColumn('Empty', lit(None)) new_df.select('Empty', 'Rating').show(5)
Add a Column with select()
We can create a new column name using the select statement as follows:
df.select('Rating', (df.Rating*2).alias('Double_Rating')).show(5)
Add a Column with SQL Statements
Finally, there is an alternative way to create columns by running SQL statements. In this case, we will need to create a temporary view first and then run the SQL select statement.
# create a temp table called mytable df.createOrReplaceTempView('mytable') spark.sql('SELECT Rating, Rating*2 as Double_Rating FROM mytable').show(5)