Predictive Hacks

Define Schema and Load Data in PySpark

In PySpark, when we read the data, the default option is inferSchema = True. Let’s see how we can define a schema and how to use it later when we will load the data.

Create a Schema

We will need to import the sql.types and then we can create the schema as follows:

from pyspark.sql.types import *

# Define the schema
my_schema = StructType([
  # Define a StructField for each field
  StructField('name', StringType(), False),
  StructField('surname', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('date', DateType(), False),
  StructField('department', StringType(), False)
])

And then we can load the data:

df = spark.read.csv(filename_path, header=True, nullValue='NA', schema=my_schema )

Here, we mentioned that the null values in our file appear as ‘NA’.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s