We often underestimate statistics and their importance and from my own experience, I can tell for certain that it’s the most important part of Data Science. One of the most useful parts of statistics in Data Science is Statistical Tests and Hypothesis Testing. We need to know which and when to use a Statistical Test so we can have the appropriate conclusions for a hypothesis.

There are 3 main steps in hypothesis testing:

- State your null (H
_{o}) and alternate (H_{1}) hypothesis. - Perform an appropriate statistical test.
- Decide whether the null hypothesis is rejected or not.

To decide if the null hypothesis is rejected or not we need to set a threshold for the p-value called the alpha value. If we set the alpha value to be 0.05 that means that we are 95% confident about the result. For the following statistical tests, the alpha is set to **0.05**.

We have mainly two types of Variables in a Dataset. The **Categorical Variables** like Gender and the **Numeric** like weight and height.

Let’s create a dataset with two categorical and two numeric variables.

import pandas as pd import numpy as np import random df=pd.DataFrame({'Gender':random.choices(["M",'F'],weights=(0.4,0.6),k=1000), 'Age_Group':random.choices(["18-35",'35-45','45-80'],weights=(0.2,0.5,0.3),k=1000)}) df['Weight']=np.where(df['Gender']=="F",np.random.normal(loc=55,scale=5,size=1000),np.random.normal(loc=70,scale=5,size=1000)) df['Height']=np.where(df['Gender']=="F",np.random.normal(loc=160,scale=5,size=1000),np.random.normal(loc=172,scale=5,size=1000)) df['Weight']=df['Weight'].astype(int) df['Height']=df['Height'].astype(int) df.head()

## Test About One Categorical Variable

Sample Question: Is there a difference in the number of men and women in the population?

For a single categorical variable that we want to check if there is a difference between the number of its values, we will use a **one proportion Z test.** Let’s state the hypothesis:

- H
_{o}: there is no difference between the number of men and woman - H
_{1}: there is a difference between the number of men and woman

We need to clarify that this is a two-sided test because we are checking if the proportion of men **Pm** is different than women **Pw**. If we wanted to check if **Pm>Pw** or **Pm<Pw** then we would have a one-tailed test.

from statsmodels.stats.proportion import proportions_ztest count = 592 #number of females nobs = 1000 #number of rows | or trials value = 0.5 # This is the value of the null hypothesis. That means porpotion of men = porpotion of women = 0.5 #we are using alternative='two-sided' because we are chcking Pm≠Pw. #for Pw>Pm we have to set it to "larger" and for Pw<Pm to "smaller" stat, pval = proportions_ztest(count, nobs, value, alternative='two-sided') print("p_value: ",round(pval,3))

`p_value: 0.0`

The p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. That means that there is a difference in the number of men and women in the population.

## Test About Two Categorical Variables

Sample

Question: Does the proportion of males and females differ across age groups?

If we want to check the independence of two categorical values, we will use the **Chi-Squared test.**

Let’s state the hypothesis:

- H
_{o}: Gender and Age Groups are Independent - H
_{1}: Gender and Age Groups are Dependent

from scipy.stats import chi2_contingency #The easiest way to apply a chi-squared test is to compute the contigency table. contigency= pd.crosstab(df['Gender'], df['Age_Group']) contigency

#Chi-square test of independence. c, p, dof, expected = chi2_contingency(contigency) print("p_value: ",round(p,3))

`p_value: 0.579`

The p-value is not less than 0.05 hence, we failed to reject the null hypothesis at a 95% level of confidence. That means that Gender and Age Groups are Independent.

## Test About one Categorical and one Numeric Variable

Sample Question: Is there a difference in height between men and women?

In this situation, we will use a T-Test (students T-Test).

- H
_{o}: There is no difference - H
_{1}: There is a difference

from scipy.stats import ttest_ind #this is a two-sided test #you can divide the two-sided p-value by two, and this will give you the one-sided one. t_stat, p = ttest_ind(df.query('Gender=="M"')['Height'], df.query('Gender=="F"')['Height']) print("p_value: ",round(p,3))

`p_value: 0.0`

The p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. That means that there is a difference in height between men and women.

## Test About one Categorical with more than two unique values and one Numeric Variable.

Sample Question: Is there a difference in height between age groups?

Now, we will use the ANOVA (Analysis Of Variance) test.

- H
_{o}: Groups means of height are equal - H
_{1}: At least, one group mean of height is different from other groups

import scipy.stats as stats # stats f_oneway functions takes the groups as input and returns ANOVA F and p value fvalue, pvalue = stats.f_oneway(df.query('Age_Group=="18-35"')['Height'], df.query('Age_Group=="35-45"')['Height'], df.query('Age_Group=="45-80"')['Height']) print("p_value: ",round(pvalue,3))

`p_value: 0.141`

The p-value is not less than 0.05 hence, we failed to reject the null hypothesis at a 95% level of confidence.

## Test About Two Numeric Variables

Sample Question: Is there a relationship between height and weight?

- H
_{o}: There is no relationship between height and weight - H
_{1}: There is a relationship between height and weight

We will use a correlation test. A correlation test will give us two things, a correlation coefficient, and a p-value. As you may already know the correlation coefficient is the number that shows us how correlated are the two variables. For its p-value, we are applying the same principles as before, if the p-value is less than 0.05 we reject the null hypothesis.

import scipy.stats as stats #for this example we will use the Pearson Correlation. pearson_coef, p_value = stats.pearsonr(df["Weight"], df["Height"]) print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", round(p_value,3) )

`Pearson Correlation Coefficient: 0.6213650837211053 and a P-value of: 0.0`

As we can see the p-value is less than 0.05 hence, we reject the null hypothesis at a 95% level of confidence. That means that there is a relationship between height and weight.

## Summing It Up

This was an introduction to Statistical Tests and Hypothesis Testing. We got a basic understanding of when you should apply Z-test, T-test, Chi-Squared test, ANOVA, and Correlation Test based on the variable types and some common questions. You can use this post as a statistical test cheat sheet but I encourage you to read more about them because as I said before, Statistics is the most important part of Data Science.