Predictive Hacks

# How to Transform an Imbalanced Dataset to a Balanced

We have provided examples of how you can Resample Data By Groups in Python and how you do Undersampling by Groups in R. In this post, we will provide you an efficient way of how you can create balanced datasets by being able to take into consideration more than one variable. Let’s start by creating our “unbalanced” dataset with the following characteristics:

• 1000 observations
• Category column of 3 levels such as “A”, “B” and “C” with 30%, 50% and 20% respectively.
• Sentiment column of 2 levels such as “0” and “1” with 35% and 65% respectively.
• Gender column of 2 levels such as “M” and “F” with 70% and 30% respectively.
```df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})

df
```

## Create a Balanced Dataset based on Sentiment

Let’s say that we want a new dataset where the positive Sentiment is as many as the negative. Let’s see how we can easily achieve that.

```df_grouped_by = df.groupby(['Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
```

Let’s verify that the dataset is balanced.

```df_balanced.groupby(['Sentiment']).size()
```

## Create a Balanced Dataset based on Category and Sentiment

Let’s say that we want to create a balanced dataset by taking into consideration the Category and the Sentiment.

```df_grouped_by = df.groupby(['Category', 'Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Category', 'Sentiment'])
df_balanced
```

Let’s verify that the dataset is balanced.

```df_balanced.groupby(['Category', 'Sentiment']).size()
```

## Create a Balanced Dataset based on Sentiment within each Category

Let’s say that we want, within each category, the Sentiment classes to be balanced. This is how we can do it:

```df_balanced = pd.DataFrame()

for i in df.Category.unique():
df_grouped_by = df.loc[df.Category==i].groupby(['Sentiment'])
tmp = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))
df_balanced = pd.concat([df_balanced, tmp])

df_balanced = df_balanced.droplevel(['Sentiment'])

df_balanced
```

Let’s confirm that we got a balanced dataset of Sentiments within each Category

```df_balanced.groupby(['Category', 'Sentiment']).size()
```

## The Takeaway

In many Data Science pipelines, there is a need to apply undersampling techniques, in order to deal with the bias of the unbalanced classes and features. In this tutorial, we provided you an efficient way of how you can create balanced datasets with a few lines of code.

### Get updates and learn from the best

Python

#### Creating Dynamic Forms with Streamlit: A Step-by-Step Guide

In this blog post, we’ll teach you how to create dynamic forms based on user input using Streamlit’s session state

Python

#### How to Connect Wikipedia with ChatGPT and LangChain

ChatGPT’s knowledge is limited to its training data, which has the cutoff year of 2021. This implies that we cannot