In this short tutorial, we will show you different ways to select columns in PySpark. For the beginning, we will load a CSV file from S3.
df = spark.read.options(header=True).csv("s3://my-bucket/pyspark_examples/folder1/") df.show()
Or if we want to get a list of the column names:
df.columns
['Row_Number', 'Hash_Name', 'Event_Date', 'Rating', 'Category', 'Class_A', 'Class_B', 'Class_C', 'File_Path']
Or using the schema info:
df.schema.names
['Row_Number', 'Hash_Name', 'Event_Date', 'Rating', 'Category', 'Class_A', 'Class_B', 'Class_C', 'File_Path']
Or the full schema:
df.printSchema()
root
|-- Row_Number: string (nullable = true)
|-- Hash_Name: string (nullable = true)
|-- Event_Date: string (nullable = true)
|-- Rating: string (nullable = true)
|-- Category: string (nullable = true)
|-- Class_A: string (nullable = true)
|-- Class_B: string (nullable = true)
|-- Class_C: string (nullable = true)
|-- File_Path: string (nullable = true)
Select Columns By Name
Let’s say that we want to select two columns, the “Row_Number” and the “Category“. Let’s see how we can do it.
df.select(df.Row_Number, df.Category).show(5)
Or we could pass the column names as follows (this is useful when the column names contain white spaces etc)
df.select(df['Row_Number'], df['Category']).show(5)
Select Columns from a List
We can also select columns from a list of column names. For example:
df.select(['Row_Number','Category']).show(5)
Keep in mind that it works also without a list. For example:
df.select('Row_Number','Category').show(5)
Select Columns with col function
We can use the col()
function from functions
. For example:
from pyspark.sql.functions import col df.select(col('Row_Number'),col('Category')).show(5)
Select Columns with Regular Expressions
We can select columns using regular expressions. Let’s say that we want to select all the columns that contain the string “Class” plus the “Row_Number“.
df.select(df.colRegex("`Class.*`"), df["Row_Number"]).show(5)
Select Columns based on the Columns’ Index
Indirectly, we can select columns based on the columns’ index. For example, let’s say that I want to select the 1st and 3rd column.
mylist = df.columns idx = [0,2] df.select([mylist[i] for i in idx]).show(5)
Select Columns with SQL Statements
Finally, there is an alternative way to select columns by running SQL statements. In this case, we will need to create a temporary view first and then run the SQL select statement.
df.createOrReplaceTempView('mytable') spark.sql('SELECT Row_Number, Event_Date FROM mytable').show(5)