Predictive Hacks

How to deal with get_dummies in Train and Test Dataset

When we work with Pandas and Machine Learning models, it is common to use the get_dummies() function to convert the categorical variables to dummies. The problem is that in the real world, the new data that we will get in order to make predictions, will not necessarily have the same variable levels and if you feed the model with an unknown column name will crush. Let’s see how we can deal with this case.

Assume that our initial data were the following:

import pandas as pd

df = pd.DataFrame({'Gender':['m','f','m','m','f'],
                   'AgeGroup': ['<18', '18-25','25-35', '35-45', '45-65']})

df

And we apply the get_dummies() function:

X_train = pd.get_dummies(df)

X_train

Now, we get a new data, the so called test dataset where there is a new level in the AgeGroup the 65+. If we apply the get_dummies() on the test dataset, we will lose the non-existing levels of the train dataset as well as we will get a new unknown column. For example:

df_new = pd.DataFrame({'Gender':['m','f','m','m','f'],
                   'AgeGroup': ['<18', '18-25','25-35', '35-45', '65+']})

X_test = pd.get_dummies(df_new)

X_test

As we can see, we missed the column AgeGroup_45-65 and we got a new one column, the AgeGroup_65+. But we want to feed our model with the same columns of the train dataset. The trick is to use the reindex function.

X_test = X_test.reindex(columns = X_train.columns, fill_value=0)

X_test

As we can see, it removed the AgeGroup_65+ and it filled with 0 the AgeGroup_45-65 column!

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.