Following up our post about Logistic Regression on Aggregated Data in R, we will show you how to deal with grouped data when you want to perform a Logic regression in Python. Let us first create some dummy data.
import pandas as pd import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf df=pd.DataFrame( { 'Gender':np.random.choice(["m","f"],200,p=[0.6,0.4]), 'Age':np.random.choice(["[<30]","[30-65]", "[65+]"],200,p=[0.3,0.6,0.1]), "Response":np.random.binomial(1,size=200,p=0.2) } ) df.head()
Gender Age Response
0 f [30-65] 0
1 m [30-65] 0
2 m [<30] 0
3 f [30-65] 1
4 f [65+] 0
Logistic Regression on Non-Aggregate Data
Firstly, we will run a Logistic Regression model on Non-Aggregate Data. We will use the library Stats Models because this is the library we will use for the aggregated data and it is easier to compare our models. Also, Stats Models can give us a model’s summary in a more classic statistical way like R.
Tip: If you don’t want to convert your categorical data into binary to perform a Logistic Regression, you can use the Stats Models formulas Instead of Sklearn.
model=smf.logit('Response~Gender+Age',data=df) result = print(result.summary())
Logit Regression Results
Dep. Variable: Response No. Observations: 200
Model: Logit Df Residuals: 196
Method: MLE Df Model: 3
Date: Mon, 22 Feb 2021 Pseudo R-squ.: 0.02765
Time: 18:09:11 Log-Likelihood: -85.502
converged: True LL-Null: -87.934
Covariance Type: nonrobust LLR p-value: 0.1821
coef std err z P>|z| [0.025 0.975]
Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399
Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665
Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810
Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001
Logistic Regression on Aggregate Data
Below are 3 methods we used to deal with aggregated data.
1. Logistic Regressions using Responders and Non-Responders
In the following code, we grouped our data and we created columns for the responders(Yes) and Non-Responders(No).
grouped=df.groupby(['Gender','Age']).agg({'Response':[sum,'count']}).droplevel(0, axis=1).rename(columns={'sum':'Yes','count':'Impressions'}).eval('No=Impressions-Yes') grouped.reset_index(inplace=True) grouped
Gender Age Yes Impressions No
0 f [30-65] 9 38 29
1 f [65+] 2 7 5
2 f [<30] 8 25 17
3 m [30-65] 17 79 62
4 m [65+] 2 12 10
5 m [<30] 9 39 30
glm_binom = smf.glm('Yes + No ~ Age + Gender',grouped, family=sm.families.Binomial()) print(result_grouped.summary())
Generalized Linear Model Regression Results
Dep. Variable: ['Yes', 'No'] No. Observations: 6
Model: GLM Df Residuals: 2
Model Family: Binomial Df Model: 3
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -8.9211
Date: Mon, 22 Feb 2021 Deviance: 1.2641
Time: 18:15:15 Pearson chi2: 0.929
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399
Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810
Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001
Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665
2. Logistic Regression with Weights
For this method, we need to create a new column with the response rate of every group.
glm = smf.glm('RR ~ Age + Gender',data=grouped, family=sm.families.Binomial(), freq_weights=np.asarray(grouped['Impressions'])) print(result_grouped2.summary())
Generalized Linear Model Regression Results
Dep. Variable: RR No. Observations: 6
Model: GLM Df Residuals: 196
Model Family: Binomial Df Model: 3
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -59.807
Date: Mon, 22 Feb 2021 Deviance: 1.2641
Time: 18:18:16 Pearson chi2: 0.929
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399
Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810
Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001
Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665
3.Expand the Aggregate Data
lastly, we can “ungroup” our data and transform our dependent variable into binary so we can perform a Logistic Regression as usual.
grouped['No']=grouped['No'].apply(lambda x: [0]*x) grouped['Yes']=grouped['Yes'].apply(lambda x: [1]*x) grouped['Response']=grouped['Yes']+grouped['No'] expanded=grouped.explode("Response")[['Gender','Age','Response']] expanded['Response']=expanded['Response'].astype(int) expanded.head()
Gender Age Response
0 f [30-65] 1
0 f [30-65] 1
0 f [30-65] 1
0 f [30-65] 1
0 f [30-65] 1
model=smf.logit('Response~ Gender + Age',data=expanded) result = print(result.summary())
Logit Regression Results
Dep. Variable: Response No. Observations: 200
Model: Logit Df Residuals: 196
Method: MLE Df Model: 3
Date: Mon, 22 Feb 2021 Pseudo R-squ.: 0.02765
Time: 18:29:33 Log-Likelihood: -85.502
converged: True LL-Null: -87.934
Covariance Type: nonrobust LLR p-value: 0.1821
coef std err z P>|z| [0.025 0.975]
Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399
Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665
Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810
Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001