When we work with Pandas and Machine Learning models, it is common to use the get_dummies()
function to convert the categorical variables to dummies. The problem is that in the real world, the new data that we will get in order to make predictions, will not necessarily have the same variable levels and if you feed the model with an unknown column name will crush. Let’s see how we can deal with this case.
Assume that our initial data were the following:
import pandas as pd df = pd.DataFrame({'Gender':['m','f','m','m','f'], 'AgeGroup': ['<18', '18-25','25-35', '35-45', '45-65']}) df
And we apply the get_dummies()
function:
X_train = pd.get_dummies(df) X_train
Now, we get a new data, the so called test dataset where there is a new level in the AgeGroup
the 65+
. If we apply the get_dummies()
on the test dataset, we will lose the non-existing levels of the train dataset as well as we will get a new unknown column. For example:
df_new = pd.DataFrame({'Gender':['m','f','m','m','f'], 'AgeGroup': ['<18', '18-25','25-35', '35-45', '65+']}) X_test = pd.get_dummies(df_new) X_test
As we can see, we missed the column AgeGroup_45-65
and we got a new one column, the AgeGroup_65+
. But we want to feed our model with the same columns of the train dataset. The trick is to use the reindex
function.
X_test = X_test.reindex(columns = X_train.columns, fill_value=0) X_test
As we can see, it removed the AgeGroup_65+
and it filled with 0 the AgeGroup_45-65
column!