Predictive Hacks

How to run Chi-Square Test in Python

chi-square test

We will provide a practical example of how we can run a Chi-Square Test in Python. Assume that we want to test if there is a statistically significant difference in Genders (M, F) population between Smokers and Non-Smokers. Let’s generate some sample data to work on it.

Sample Data

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
 
df = pd.DataFrame({'Gender' : ['M', 'M', 'M', 'F', 'F'] * 10,
                   'isSmoker' : ['Smoker', 'Smoker', 'Non-Smpoker', 'Non-Smpoker', 'Smoker'] * 10
                  })
df.head()
 
	Gender	isSmoker
0	M	Smoker
1	M	Smoker
2	M	Non-Smpoker
3	F	Non-Smpoker
4	F	Smoker
 

Contingency Table

To run the Chi-Square Test, the easiest way is to convert the data into a contingency table with frequencies. We will use the crosstab command from pandas.

contigency= pd.crosstab(df['Gender'], df['isSmoker'])
contigency
 
isSmokerNon-SmpokerSmoker
Gender
F1010
M1020

Let’s say that we want to get the percentages by Gender (row)

contigency_pct = pd.crosstab(df['Gender'], df['isSmoker'], normalize='index')
contigency_pct
 
isSmokerNon-SmpokerSmoker
Gender
F0.5000000.500000
M0.3333330.666667

If we want the percentages by column, then we should write normalize=’column’ and if we want the total percentage then we should write normalize=’all’


Heatmaps

An easy way to see visually the contingency tables are the heatmaps.

plt.figure(figsize=(12,8))
sns.heatmap(contigency, annot=True, cmap="YlGnBu")
 
chi-square

Chi-Square Test

Now that we have built the contingency table we can pass it to chi2_contingency function from the scipy package which returns the:

  • chi2: The test statistic
  • p: The p-value of the test
  • dof: Degrees of freedom
  • expected: The expected frequencies, based on the marginal sums of the table

# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contigency)
p
 
0.3767591178115821

Inference

The p-value is 37.67% which means that we do not reject the null hypothesis at 95% level of confidence. The null hypothesis was that Smokers and Gender are independent. In this example, the contingency table was 2×2. We could have applied z-test for proportions instead of Chi-Square test. Notice that the Chi-Square test can be extended to m x n contingency tables.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

connect with sql
R

How to Connect R with SQL

Need to Connect R with SQL It is common for Data Analysts/Scientists to connect R with SQL. For that reason,

[the_ad_group id="232"]
[the_ad id="2133"]