This post is a short tutorial in Pandas GroupBy. As always we will work with examples. Let’s create a dummy DataFrame for demonstration purposes.
df = pd.DataFrame({'Gender':['m','m','m','f','f','f','f', 'm','f','f'], 'Type':['a','b','c','a','b','c','c', 'c','c','b'], 'ColA':[10,20,30,40,50,60,70,80,90,100], 'ColB':[0,5,10,15,25,30,50,10,20,30]}) df
Tip: How to return results without Index
In many cases, we do not want the column(s) of the group by operations to appear as indexes. For that reason, we use to add the reset_index() at the end. For example, let’s say that we want to get the average of ColA group by Gender
df.groupby('Gender')['ColA'].mean()
Output:
Now, if we want to remove the Gender as index we add the reset_index() command as the end:
df.groupby('Gender')['ColA'].mean().reset_index()
Output:
Tip: Instead of typing the rest_index() command, you can add the as_index=False in the groupby
and you will get the same output. For example:
df.groupby('Gender', as_index=False)['ColA'].mean()
This can be extended to more columns. For example, let’s say that we group by Gender
and Type
With index:
df.groupby(['Gender', 'Type'])['ColA'].mean()
Output:
Without index:
df.groupby(['Gender', 'Type'], as_index=False)['ColA'].mean()
Output:
Tip: How to get the groups
Once we group our data frame, we can show and get them. For example, let’s assume that we group our DataFrame by Type
grouped = df.groupby('Type')
How to iterate over groups?
We can iterate over groups as follows:
for g in grouped: print(g)
How to get a group?
We can get a specific group using the command get_group
. For example, let’s say that we want to get the group of Type “b“.
grouped.get_group('b')
Tip: How to apply multiple functions
Let’s say that we want for ColA to calculate the mean
and var
and for ColB to calculate the min
and max
, group by Gender.
df.groupby('Gender').agg({'ColA':['mean', 'var'], 'ColB':['min', 'max'] })
Tip: How to change the names of the aggregated columns
We can change the names by passing a tuple as follows:
df.groupby('Gender').agg({'ColA':[('ColA_Mean','mean'), ('ColA_Var', 'var')], 'ColB':[('ColB_Min','min'), ('ColB_Max', 'max')] })
Tip: How to add a custom function
Let’s say that we want to add a custom calculation which is the range, i.e max-min for ColA.
df.groupby('Gender').agg({'ColA':[('ColA_Mean','mean'), ('ColA_Var', 'var'), ('CustomFunction', lambda x: x.max() - x.min())], 'ColB':[('ColB_Min','min'), ('ColB_Max', 'max')] })
Tip: Dealing with Multiple Indexes
Let’s create a grouped DataFrame with multiple indexes. You can find more details at Pandas Documentation
ex = df.groupby(['Gender', 'Type']).agg({'ColA':['mean'], 'ColB':['min', 'max'] }) ex
Let’s say now that I want to get the row where Gender==’f’ and Type==’c’. We can just use the .loc
and pass the values as follows:
ex.loc[('f','c')]
Let’s say that I want to run the same query but this time to get the data only for ColA:
ex.loc[('f','c'), 'ColA']
Let’s say now that I want to get the results of ColB for all indexes:
ex.loc[:, 'ColB']
Tip: Slicers with Multiple Indexes
Let’s say that I want to get all the levels from Gender and the levels ‘a’ and ‘b’ from Type. I can use the slicers as follows:
ex.loc[(slice(None), slice('a','b')), :]
Notice: You can use slice(None)
to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None)
.
Tip: Reset a column’s MultiIndex levels
As we see here in our example DataFrame called ‘ex‘, we have Multiple Indexes even in columns. Let’s see how we can reset them.
ex.columns = ex.columns.droplevel(0) ex = ex.rename_axis(None, axis=1) ex
Finally, if we want to reset also the row indexes we can use the command reset_index()
ex.reset_index()
Notice that we could reset only one level of the indexes as follows:
ex.reset_index(level=0)