Predictive Hacks

How to Transform an Imbalanced Dataset to a Balanced

undersampling

We have provided examples of how you can Resample Data By Groups in Python and how you do Undersampling by Groups in R. In this post, we will provide you an efficient way of how you can create balanced datasets by being able to take into consideration more than one variable. Let’s start by creating our “unbalanced” dataset with the following characteristics:

  • 1000 observations
  • Category column of 3 levels such as “A”, “B” and “C” with 30%, 50% and 20% respectively.
  • Sentiment column of 2 levels such as “0” and “1” with 35% and 65% respectively.
  • Gender column of 2 levels such as “M” and “F” with 70% and 30% respectively.
df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
                   'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
                   'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})

df

Create a Balanced Dataset based on Sentiment

Let’s say that we want a new dataset where the positive Sentiment is as many as the negative. Let’s see how we can easily achieve that.

df_grouped_by = df.groupby(['Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced

Let’s verify that the dataset is balanced.

df_balanced.groupby(['Sentiment']).size()

Create a Balanced Dataset based on Category and Sentiment

Let’s say that we want to create a balanced dataset by taking into consideration the Category and the Sentiment.

df_grouped_by = df.groupby(['Category', 'Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Category', 'Sentiment'])
df_balanced

Let’s verify that the dataset is balanced.

df_balanced.groupby(['Category', 'Sentiment']).size()

Create a Balanced Dataset based on Sentiment within each Category

Let’s say that we want, within each category, the Sentiment classes to be balanced. This is how we can do it:

df_balanced = pd.DataFrame()

for i in df.Category.unique():
    df_grouped_by = df.loc[df.Category==i].groupby(['Sentiment'])
    tmp = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))
    df_balanced = pd.concat([df_balanced, tmp])
    
df_balanced = df_balanced.droplevel(['Sentiment'])  

df_balanced  

Let’s confirm that we got a balanced dataset of Sentiments within each Category

df_balanced.groupby(['Category', 'Sentiment']).size()

The Takeaway

In many Data Science pipelines, there is a need to apply undersampling techniques, in order to deal with the bias of the unbalanced classes and features. In this tutorial, we provided you an efficient way of how you can create balanced datasets with a few lines of code.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.