PySpark has many alternative options to read data. Also, the commands are different depending on the Spark Version. Below, we will show you how to read multiple compressed CSV files that are stored in S3 using PySpark.
Assume that we are dealing with the following 4 .gz files.
aws s3 ls s3://my-bucket/pyspark_examples/flights/ --human-readable 2021-11-25 18:26:50 668.3 KiB AA_DFW_2014_Departures_Short.csv.gz 2021-11-25 18:26:50 631.6 KiB AA_DFW_2015_Departures_Short.csv.gz 2021-11-25 18:26:50 615.8 KiB AA_DFW_2016_Departures_Short.csv.gz 2021-11-25 18:26:50 612.1 KiB AA_DFW_2017_Departures_Short.csv.gz
Note that all files have headers. For this example, we will work with spark 3.1.1.
sc.version
df = spark.read.options(header=True).csv("s3://my-bucket/pyspark_examples/flights/") df.show()
Finally, if we want to get the schema of the data frame, we can run:
df.schema
StructType(List(StructField(Date (MM/DD/YYYY),StringType,true),StructField(Flight Number,StringType,true),StructField(Destination Airport,StringType,true),StructField(Actual elapsed time (Minutes),StringType,true)))
Or, in a more compact form:
df.printSchema()
root |-- Date (MM/DD/YYYY): string (nullable = true) |-- Flight Number: string (nullable = true) |-- Destination Airport: string (nullable = true) |-- Actual elapsed time (Minutes): string (nullable = true)
Define Specific Files
Note, that you can explicitly define the required files that you want to upload. In the previous example, we loaded all the files under the folder. Let’s say that we want to load only two files. We can do it as follows:
path = ['s3://my-bucket/pyspark_examples/flights/AA_DFW_2014_Departures_Short.csv.gz', 's3://my-bucket/pyspark_examples/flights/AA_DFW_2015_Departures_Short.csv.gz'] df = spark.read.options(header=True).csv(path) df.show()