Predictive Hacks

How to Select Columns in PySpark

In this short tutorial, we will show you different ways to select columns in PySpark. For the beginning, we will load a CSV file from S3.

df = spark.read.options(header=True).csv("s3://my-bucket/pyspark_examples/folder1/")
df.show()

Or if we want to get a list of the column names:

df.columns
['Row_Number', 'Hash_Name', 'Event_Date', 'Rating', 'Category', 'Class_A', 'Class_B', 'Class_C', 'File_Path']

Or using the schema info:

df.schema.names
['Row_Number', 'Hash_Name', 'Event_Date', 'Rating', 'Category', 'Class_A', 'Class_B', 'Class_C', 'File_Path']

Or the full schema:

df.printSchema()
root
 |-- Row_Number: string (nullable = true)
 |-- Hash_Name: string (nullable = true)
 |-- Event_Date: string (nullable = true)
 |-- Rating: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Class_A: string (nullable = true)
 |-- Class_B: string (nullable = true)
 |-- Class_C: string (nullable = true)
 |-- File_Path: string (nullable = true)

Select Columns By Name

Let’s say that we want to select two columns, the “Row_Number” and the “Category“. Let’s see how we can do it.

df.select(df.Row_Number, df.Category).show(5)

Or we could pass the column names as follows (this is useful when the column names contain white spaces etc)

df.select(df['Row_Number'], df['Category']).show(5)

Select Columns from a List

We can also select columns from a list of column names. For example:

df.select(['Row_Number','Category']).show(5)

Keep in mind that it works also without a list. For example:

df.select('Row_Number','Category').show(5)

Select Columns with col function

We can use the col() function from functions. For example:

from pyspark.sql.functions import col
df.select(col('Row_Number'),col('Category')).show(5)

Select Columns with Regular Expressions

We can select columns using regular expressions. Let’s say that we want to select all the columns that contain the string “Class” plus the “Row_Number“.

df.select(df.colRegex("`Class.*`"), df["Row_Number"]).show(5)

Select Columns based on the Columns’ Index

Indirectly, we can select columns based on the columns’ index. For example, let’s say that I want to select the 1st and 3rd column.

mylist = df.columns
idx = [0,2]
df.select([mylist[i] for i in idx]).show(5)

Select Columns with SQL Statements

Finally, there is an alternative way to select columns by running SQL statements. In this case, we will need to create a temporary view first and then run the SQL select statement.

df.createOrReplaceTempView('mytable')

spark.sql('SELECT Row_Number, Event_Date FROM mytable').show(5)

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s