Predictive Hacks

# How to Resample Data by Group In Pandas Sometimes when we are working on machine learning projects, there are some factors that can have a huge impact on the performance and they are not manageable or structured. A solution is to remove their effect in our data by sampling based on the factor we want to normalize.

Let’s create the data for our Data:

let’s suppose that we have a factor called Campaigns with the following groups:

• Campaign 1: Smartphone related
• Campaign 2: Camera related
• Campaign 3: Computer related
```import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import plotly.express as px

Smartphone = pd.DataFrame(
{
"Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]),
"Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]),
"Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.6, 0.4]),
"Campaign": ["Smartphone"] * 200,
"Click": np.random.binomial(1, size=200, p=0.6),
}
)

Camera = pd.DataFrame(
{
"Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]),
"Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]),
"Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.3, 0.7]),
"Campaign": ["Camera"] * 200,
"Click": np.random.binomial(1, size=200, p=0.2),
}
)

Computer = pd.DataFrame(
{
"Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]),
"Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]),
"Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.25, 0.75]),
"Campaign": ["Computer"] * 200,
"Click": np.random.binomial(1, size=200, p=0.25),
}
)

df=pd.concat([Smartphone,Camera,Computer])

df.sample(10)
```
``````    Gender      Age Add_Color    Campaign  Click
115      m    [<30]       Red  Smartphone      1
12       m  [30-65]       Red    Computer      0
112      m  [30-65]       Red    Computer      0
148      m    [<30]       Red    Computer      1
127      m    [<30]      Blue    Computer      0
83       f    [<30]       Red  Smartphone      1
168      f  [30-65]       Red    Computer      0
80       f  [30-65]       Red    Computer      0
25       m    [<30]       Red      Camera      0
11       f  [30-65]       Red  Smartphone      0``````

Below we can see that every group has a different click rate. That would be ok if we wanted to use this feature in our model. However, if we can’t use it (maybe because in the future we may have different campaigns and we want a universal model), we have to somehow remove this effect otherwise we will have a biased model.

```print(df.groupby('Campaign').mean())
```
``````            Click
Campaign
Camera       0.21
Computer     0.22
Smartphone   0.59``````

What we want to achieve is to resample every group to result to an eqaul Click Rate.

## Resample Data by Group

In our example, we are working with clicks. So, we have two classes, 0 and 1. What we want to achieve is to have an equal amount of each for every campaign so the click rate will be 0.5. We will use the Pandas function sample. Basically, what it does is given a data-frame and a number, it gets an equal amount of random rows with this number with no replacement.

The tricky part here is that we have to define the Minority and the Majority class for every campaign because as we can see, the minority class for the Smartphone campaign is class 0 and the minority for Computer and Camera is 1.

```#get the unique campaigns
campaigns=df.Campaign.unique()

sampled=pd.DataFrame()
for i in campaigns:
print(i)
#keep the campaign we want to sample
z=df.query(f'Campaign=="{i}"')

A=z[z['Click']==0][['Click']]
B=z[z['Click']==1][['Click']]

#find out which is the Minority and Majority
if len(A)>len(B):
majority=A
minority=B
else:
majority=B
minority=A

#Sampling
indexes=majority.sample(len(minority),random_state=7).index
#what we did here is to get the indexes that are NOT in the sampling above
#so we can remove them in the following steps from our dataframe z
indexes=majority.loc[~majority.index.isin(indexes)].index

z=z.loc[~z.index.isin(indexes)]
sampled=pd.concat([sampled,z])

sampled.groupby('Campaign').mean()
```
``````            Click
Campaign
Camera        0.5
Computer      0.5
Smartphone    0.5``````

## Final Remarks

We lost some of our data but we are resulting in more meaningful data to use for our model. This is not an exclusive method to deal with this kind of problem but a simple solution that we are using. Biased data is one of the most common problems in Machine Learning and can lead to major problems so this is a nice tool to add to your toolset.