In PySpark, when we read the data, the default option is inferSchema = True. Let’s see how we can define a schema and how to use it later when we will load the data.
Create a Schema
We will need to import the sql.types and then we can create the schema as follows:
from pyspark.sql.types import * # Define the schema my_schema = StructType([ # Define a StructField for each field StructField('name', StringType(), False), StructField('surname', StringType(), False), StructField('age', IntegerType(), False), StructField('date', DateType(), False), StructField('department', StringType(), False) ])
And then we can load the data:
df = spark.read.csv(filename_path, header=True, nullValue='NA', schema=my_schema )
Here, we mentioned that the null values in our file appear as ‘NA’.