Sometimes when we are working on machine learning projects, there are some factors that can have a huge impact on the performance and they are not manageable or structured. A solution is to remove their effect in our data by sampling based on the factor we want to normalize.
Let’s create the data for our Data:
let’s suppose that we have a factor called Campaigns with the following groups:
- Campaign 1: Smartphone related
- Campaign 2: Camera related
- Campaign 3: Computer related
import pandas as pd import numpy as np import statsmodels.formula.api as smf import plotly.express as px Smartphone = pd.DataFrame( { "Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]), "Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]), "Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.6, 0.4]), "Campaign": ["Smartphone"] * 200, "Click": np.random.binomial(1, size=200, p=0.6), } ) Camera = pd.DataFrame( { "Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]), "Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]), "Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.3, 0.7]), "Campaign": ["Camera"] * 200, "Click": np.random.binomial(1, size=200, p=0.2), } ) Computer = pd.DataFrame( { "Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]), "Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]), "Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.25, 0.75]), "Campaign": ["Computer"] * 200, "Click": np.random.binomial(1, size=200, p=0.25), } ) df=pd.concat([Smartphone,Camera,Computer]) df.sample(10)
Gender Age Add_Color Campaign Click
115 m [<30] Red Smartphone 1
12 m [30-65] Red Computer 0
112 m [30-65] Red Computer 0
148 m [<30] Red Computer 1
127 m [<30] Blue Computer 0
83 f [<30] Red Smartphone 1
168 f [30-65] Red Computer 0
80 f [30-65] Red Computer 0
25 m [<30] Red Camera 0
11 f [30-65] Red Smartphone 0
Below we can see that every group has a different click rate. That would be ok if we wanted to use this feature in our model. However, if we can’t use it (maybe because in the future we may have different campaigns and we want a universal model), we have to somehow remove this effect otherwise we will have a biased model.
print(df.groupby('Campaign').mean())
Click
Campaign
Camera 0.21
Computer 0.22
Smartphone 0.59
What we want to achieve is to resample every group to result to an eqaul Click Rate.
Resample Data by Group
In our example, we are working with clicks. So, we have two classes, 0 and 1. What we want to achieve is to have an equal amount of each for every campaign so the click rate will be 0.5. We will use the Pandas function sample. Basically, what it does is given a data-frame and a number, it gets an equal amount of random rows with this number with no replacement.
The tricky part here is that we have to define the Minority and the Majority class for every campaign because as we can see, the minority class for the Smartphone campaign is class 0 and the minority for Computer and Camera is 1.
#get the unique campaigns campaigns=df.Campaign.unique() sampled=pd.DataFrame() for i in campaigns: print(i) #keep the campaign we want to sample z=df.query(f'Campaign=="{i}"') A=z[z['Click']==0][['Click']] B=z[z['Click']==1][['Click']] #find out which is the Minority and Majority if len(A)>len(B): majority=A minority=B else: majority=B minority=A #Sampling indexes=majority.sample(len(minority),random_state=7).index #what we did here is to get the indexes that are NOT in the sampling above #so we can remove them in the following steps from our dataframe z indexes=majority.loc[~majority.index.isin(indexes)].index z=z.loc[~z.index.isin(indexes)] sampled=pd.concat([sampled,z]) sampled.groupby('Campaign').mean()
Click
Campaign
Camera 0.5
Computer 0.5
Smartphone 0.5
Final Remarks
We lost some of our data but we are resulting in more meaningful data to use for our model. This is not an exclusive method to deal with this kind of problem but a simple solution that we are using. Biased data is one of the most common problems in Machine Learning and can lead to major problems so this is a nice tool to add to your toolset.