We have provided examples of how you can Resample Data By Groups in Python and how you do Undersampling by Groups in R. In this post, we will provide you an efficient way of how you can create balanced datasets by being able to take into consideration more than one variable. Let’s start by creating our “unbalanced” dataset with the following characteristics:
- 1000 observations
- Category column of 3 levels such as “A”, “B” and “C” with 30%, 50% and 20% respectively.
- Sentiment column of 2 levels such as “0” and “1” with 35% and 65% respectively.
- Gender column of 2 levels such as “M” and “F” with 70% and 30% respectively.
df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]), 'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]), 'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])}) df
Create a Balanced Dataset based on Sentiment
Let’s say that we want a new dataset where the positive Sentiment is as many as the negative. Let’s see how we can easily achieve that.
df_grouped_by = df.groupby(['Sentiment']) df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True)) df_balanced = df_balanced.droplevel(['Sentiment']) df_balanced
Let’s verify that the dataset is balanced.
df_balanced.groupby(['Sentiment']).size()
Create a Balanced Dataset based on Category and Sentiment
Let’s say that we want to create a balanced dataset by taking into consideration the Category and the Sentiment.
df_grouped_by = df.groupby(['Category', 'Sentiment']) df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True)) df_balanced = df_balanced.droplevel(['Category', 'Sentiment']) df_balanced
Let’s verify that the dataset is balanced.
df_balanced.groupby(['Category', 'Sentiment']).size()
Create a Balanced Dataset based on Sentiment within each Category
Let’s say that we want, within each category, the Sentiment classes to be balanced. This is how we can do it:
df_balanced = pd.DataFrame() for i in df.Category.unique(): df_grouped_by = df.loc[df.Category==i].groupby(['Sentiment']) tmp = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True)) df_balanced = pd.concat([df_balanced, tmp]) df_balanced = df_balanced.droplevel(['Sentiment']) df_balanced
Let’s confirm that we got a balanced dataset of Sentiments within each Category
df_balanced.groupby(['Category', 'Sentiment']).size()
The Takeaway
In many Data Science pipelines, there is a need to apply undersampling techniques, in order to deal with the bias of the unbalanced classes and features. In this tutorial, we provided you an efficient way of how you can create balanced datasets with a few lines of code.