Predictive Hacks

How to run Chi-Square Test in Python

chi-square test

We will provide a practical example of how we can run a Chi-Square Test in Python. Assume that we want to test if there is a statistically significant difference in Genders (M, F) population between Smokers and Non-Smokers. Let’s generate some sample data to work on it.

Sample Data

import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
 
df = pd.DataFrame({'Gender' : ['M', 'M', 'M', 'F', 'F'] * 10,
                   'isSmoker' : ['Smoker', 'Smoker', 'Non-Smpoker', 'Non-Smpoker', 'Smoker'] * 10
                  })
df.head()
 
	Gender	isSmoker
0	M	Smoker
1	M	Smoker
2	M	Non-Smpoker
3	F	Non-Smpoker
4	F	Smoker
 

Contingency Table

To run the Chi-Square Test, the easiest way is to convert the data into a contingency table with frequencies. We will use the crosstab command from pandas.

contigency= pd.crosstab(df['Gender'], df['isSmoker'])
contigency
 
isSmokerNon-SmpokerSmoker
Gender
F1010
M1020

Let’s say that we want to get the percentages by Gender (row)

contigency_pct = pd.crosstab(df['Gender'], df['isSmoker'], normalize='index')
contigency_pct
 
isSmokerNon-SmpokerSmoker
Gender
F0.5000000.500000
M0.3333330.666667

If we want the percentages by column, then we should write normalize=’column’ and if we want the total percentage then we should write normalize=’all’


Heatmaps

An easy way to see visually the contingency tables are the heatmaps.

plt.figure(figsize=(12,8))
sns.heatmap(contigency, annot=True, cmap="YlGnBu")
 
chi-square

Chi-Square Test

Now that we have built the contingency table we can pass it to chi2_contingency function from the scipy package which returns the:

  • chi2: The test statistic
  • p: The p-value of the test
  • dof: Degrees of freedom
  • expected: The expected frequencies, based on the marginal sums of the table

# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contigency)
p
 
0.3767591178115821

Inference

The p-value is 37.67% which means that we do not reject the null hypothesis at 95% level of confidence. The null hypothesis was that Smokers and Gender are independent. In this example, the contingency table was 2×2. We could have applied z-test for proportions instead of Chi-Square test. Notice that the Chi-Square test can be extended to m x n contingency tables.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Get Started with Hugging Face Auto Train

Hugging Face has launched the auto train, which is a new way to automatically train, evaluate and deploy state-of-the-art Machine