When we are running an Exploratory Data Analysis (EDA), it is common to return some statistics by running summary data using “group by” operations. The tricky and error-prone part is that by default, the Pandas “group by” ignores the NAs. Let’s see some examples.
import pandas as pd import numpy as np df = pd.DataFrame({'A':[1,1,2,2,3,3, np.nan, 3], 'B':['a','a','a','b',np.nan,'b','b',np.nan], 'C':[10,20,30,10,20,30,10,20]}) df
Let’s run some group by operations.
df.groupby(['A'])['C'].mean()
A
1.0 15.000000
2.0 20.000000
3.0 23.333333
Name: C, dtype: float64
As we can see, the NaN values were ignored. We can easily return the NaNs by adding the dropna=False
within the group by.
df.groupby(['A'], dropna=False)['C'].mean()
A
1.0 15.000000
2.0 20.000000
3.0 23.333333
NaN 10.000000
Name: C, dtype: float64
As we can see, the NaN appeared in the output. Another approach could be to fill the NAs with a number and then to run the group by. Finally, another approach could be to set the data type of the grouping variable to string. For example:
Fill NAs Approach
# create a copy of the initial df df1 = df.copy() # fill the NA with the "Unknown" string df1['A'] = df1['A'].fillna("Unknown") df1.groupby(['A'])['C'].mean()
A
1.0 15.000000
2.0 20.000000
3.0 23.333333
Unknown 10.000000
Name: C, dtype: float64
Change Data Type Approach
# create a copy of the initial df df2 = df.copy() # set the grouping variable as string df2['A'] = df2['A'].astype('str') df2.groupby(['A'])['C'].mean()
A
1.0 15.000000
2.0 20.000000
3.0 23.333333
nan 10.000000
Name: C, dtype: float64