Predictive Hacks

A Complete Guide of How to Choose and Apply the Right Statistical Test in Python

statistical tests python

We often underestimate statistics and their importance and from my own experience, I can tell for certain that it’s the most important part of Data Science. One of the most useful parts of statistics in Data Science is Statistical Tests and Hypothesis Testing. We need to know which and when to use a Statistical Test so we can have the appropriate conclusions for a hypothesis.

There are 3 main steps in hypothesis testing:

  1. State your null (Ho) and alternate (H1) hypothesis.
  2. Perform an appropriate statistical test.
  3. Decide whether the null hypothesis is rejected or not.

To decide if the null hypothesis is rejected or not we need to set a threshold for the p-value called the alpha value. If we set the alpha value to be 0.05 that means that we are 95% confident about the result. For the following statistical tests, the alpha is set to 0.05.

We have mainly two types of Variables in a Dataset. The Categorical Variables like Gender and the Numeric like weight and height.

Let’s create a dataset with two categorical and two numeric variables.

import pandas as pd
import numpy as np
import random 


df=pd.DataFrame({'Gender':random.choices(["M",'F'],weights=(0.4,0.6),k=1000),
                'Age_Group':random.choices(["18-35",'35-45','45-80'],weights=(0.2,0.5,0.3),k=1000)})
df['Weight']=np.where(df['Gender']=="F",np.random.normal(loc=55,scale=5,size=1000),np.random.normal(loc=70,scale=5,size=1000))
df['Height']=np.where(df['Gender']=="F",np.random.normal(loc=160,scale=5,size=1000),np.random.normal(loc=172,scale=5,size=1000))
df['Weight']=df['Weight'].astype(int)
df['Height']=df['Height'].astype(int)

df.head()
statistical tests python

Test About One Categorical Variable

Sample Question: Is there a difference in the number of men and women in the population?

For a single categorical variable that we want to check if there is a difference between the number of its values, we will use a one proportion Z test. Let’s state the hypothesis:

  • Ho: there is no difference between the number of men and woman
  • H1: there is a difference between the number of men and woman

We need to clarify that this is a two-sided test because we are checking if the proportion of men Pm is different than women Pw. If we wanted to check if Pm>Pw or Pm<Pw then we would have a one-tailed test.

from statsmodels.stats.proportion import proportions_ztest

count = 592 #number of females 
nobs = 1000 #number of rows | or trials 
value = 0.5 # This is the value of the null hypothesis. That means porpotion of men = porpotion of women = 0.5

#we are using alternative='two-sided' because we are chcking Pm≠Pw.
#for Pw>Pm we have to set it to "larger" and for Pw<Pm to "smaller"

stat, pval = proportions_ztest(count, nobs, value, alternative='two-sided')

print("p_value: ",round(pval,3))
p_value:  0.0

The p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. That means that there is a difference in the number of men and women in the population.

Test About Two Categorical Variables

Sample Question: Does the proportion of males and females differ across age groups?

If we want to check the independence of two categorical values, we will use the Chi-Squared test.

Let’s state the hypothesis:

  • Ho: Gender and Age Groups are Independent
  • H1: Gender and Age Groups are Dependent
from scipy.stats import chi2_contingency

#The easiest way to apply a chi-squared test is to compute the contigency table.

contigency= pd.crosstab(df['Gender'], df['Age_Group'])
contigency
#Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contigency)

print("p_value: ",round(p,3))
p_value:  0.579

The p-value is not less than 0.05 hence, we failed to reject the null hypothesis at a 95% level of confidence. That means that Gender and Age Groups are Independent.

Test About one Categorical and one Numeric Variable

Sample Question: Is there a difference in height between men and women?

In this situation, we will use a T-Test (students T-Test).

  • Ho: There is no difference
  • H1: There is a difference
from scipy.stats import ttest_ind

#this is a two-sided test
#you can divide the two-sided p-value by two, and this will give you the one-sided one.

t_stat, p = ttest_ind(df.query('Gender=="M"')['Height'], df.query('Gender=="F"')['Height'])

print("p_value: ",round(p,3))
p_value:  0.0

The p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. That means that there is a difference in height between men and women.

Test About one Categorical with more than two unique values and one Numeric Variable.

Sample Question: Is there a difference in height between age groups?

Now, we will use the ANOVA (Analysis Of Variance) test.

  • Ho: Groups means of height are equal
  • H1: At least, one group mean of height is different from other groups
import scipy.stats as stats

# stats f_oneway functions takes the groups as input and returns ANOVA F and p value
fvalue, pvalue = stats.f_oneway(df.query('Age_Group=="18-35"')['Height'],
                                df.query('Age_Group=="35-45"')['Height'],
                                df.query('Age_Group=="45-80"')['Height'])

print("p_value: ",round(pvalue,3))
p_value:  0.141

The p-value is not less than 0.05 hence, we failed to reject the null hypothesis at a 95% level of confidence.

Test About Two Numeric Variables

Sample Question: Is there a relationship between height and weight?

  • Ho: There is no relationship between height and weight
  • H1: There is a relationship between height and weight

We will use a correlation test. A correlation test will give us two things, a correlation coefficient, and a p-value. As you may already know the correlation coefficient is the number that shows us how correlated are the two variables. For its p-value, we are applying the same principles as before, if the p-value is less than 0.05 we reject the null hypothesis.

import scipy.stats as stats

#for this example we will use the Pearson Correlation.
pearson_coef, p_value = stats.pearsonr(df["Weight"], df["Height"])

print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", round(p_value,3) )
Pearson Correlation Coefficient:  0.6213650837211053 and a P-value of: 0.0

As we can see the p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. That means that there is a relationship between height and weight.

Summing It Up

This was an introduction to Statistical Tests and Hypothesis Testing. We got a basic understanding of when you should apply Z-test, T-test, Chi-Squared test, ANOVA, and Correlation Test based on the variable types and some common questions. You can use this post as a statistical test cheat sheet but I encourage you to read more about them because as I said before, Statistics is the most important part of Data Science.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s