How to deal with get_dummies in Train and Test Dataset

When we work with Pandas and Machine Learning models, it is common to use the get_dummies() function to convert the categorical variables to dummies. The problem is that in the real world, the new data that we will get in order to make predictions, will not necessarily have the same variable levels and if you feed the model with an unknown column name will crush. Let’s see how we can deal with this case.

Assume that our initial data were the following:

import pandas as pd

df = pd.DataFrame({'Gender':['m','f','m','m','f'],
                   'AgeGroup': ['<18', '18-25','25-35', '35-45', '45-65']})


And we apply the get_dummies() function:

X_train = pd.get_dummies(df)


Now, we get a new data, the so called test dataset where there is a new level in the AgeGroup the 65+. If we apply the get_dummies() on the test dataset, we will lose the non-existing levels of the train dataset as well as we will get a new unknown column. For example:

df_new = pd.DataFrame({'Gender':['m','f','m','m','f'],
                   'AgeGroup': ['<18', '18-25','25-35', '35-45', '65+']})

X_test = pd.get_dummies(df_new)


As we can see, we missed the column AgeGroup_45-65 and we got a new one column, the AgeGroup_65+. But we want to feed our model with the same columns of the train dataset. The trick is to use the reindex function.

X_test = X_test.reindex(columns = X_train.columns, fill_value=0)


As we can see, it removed the AgeGroup_65+ and it filled with 0 the AgeGroup_45-65 column!

