Predictive Hacks

How to Read Multiple CSV Files in PySpark

PySpark has many alternative options to read data. Also, the commands are different depending on the Spark Version. Below, we will show you how to read multiple compressed CSV files that are stored in S3 using PySpark.

Assume that we are dealing with the following 4 .gz files.

aws s3 ls s3://my-bucket/pyspark_examples/flights/ --human-readable

2021-11-25 18:26:50  668.3 KiB AA_DFW_2014_Departures_Short.csv.gz
2021-11-25 18:26:50  631.6 KiB AA_DFW_2015_Departures_Short.csv.gz
2021-11-25 18:26:50  615.8 KiB AA_DFW_2016_Departures_Short.csv.gz
2021-11-25 18:26:50  612.1 KiB AA_DFW_2017_Departures_Short.csv.gz

Note that all files have headers. For this example, we will work with spark 3.1.1.

df ="s3://my-bucket/pyspark_examples/flights/")

Finally, if we want to get the schema of the data frame, we can run:

StructType(List(StructField(Date (MM/DD/YYYY),StringType,true),StructField(Flight Number,StringType,true),StructField(Destination Airport,StringType,true),StructField(Actual elapsed time (Minutes),StringType,true)))

Or, in a more compact form:


 |-- Date (MM/DD/YYYY): string (nullable = true)
 |-- Flight Number: string (nullable = true)
 |-- Destination Airport: string (nullable = true)
 |-- Actual elapsed time (Minutes): string (nullable = true)

Define Specific Files

Note, that you can explicitly define the required files that you want to upload. In the previous example, we loaded all the files under the folder. Let’s say that we want to load only two files. We can do it as follows:

path = ['s3://my-bucket/pyspark_examples/flights/AA_DFW_2014_Departures_Short.csv.gz',

df =

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore


Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.


Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s