Bootstrapping is a method that estimates the population characteristics by using repeated sampling of a representative sample. In this post, we will use bootstrap in a real case scenario in which we will try to estimate the confidence interval of the population mean.
For this example, we are using the hotel-reviews dataset from kaggle.com
Let’s import the libraries, load the data and draw a sample of 500 rows to use as our representative sample of the “population”.
import pandas as pd import numpy as np #moview reviews df=pd.read_csv('archive/7282_1.csv') #sample s=df.sample(500,random_state=7) s.head()
The Scenario
Let’s say that we are given this sample of 500 rows and we want somehow to estimate the 95% confidence interval of the mean. We can start by computing the summary statistics and plot the histogram of the ratings.
s['reviews.rating'].hist()
s['reviews.rating'].describe()
The histogram of the ratings is not a recognizable form like a normal distribution. Its mean is 3.67 but the question is what can we say about the true mean value. We cannot build a confidence interval by simulating the sampling distribution because we don’t know how to describe it.
Using Bootstrap will estimate the uncertainty of the mean by generating samples from our data and then characterize the distribution of the mean over these samples.
We will sample our data “with replacement“. That means we draw random ratings allowing the same rating to be drawn again.
#bootstrapping bootstrap=pd.DataFrame({'mean_rating':[s.sample(500,replace=True)['reviews.rating'].mean() for i in range(0,1000)]}) bootstrap
We created a dataframe having the mean ratings of 1000 samples. Plotting the histogram of the bootstrapped samples, we can clearly see that it approximates a normal distribution(in line with the central limit theorem).
bootstrap['mean_rating'].hist()
Now, we can extract the quantiles:
(bootstrap['mean_rating'].quantile(0.025),bootstrap['mean_rating'].quantile(0.975))
(3.554730600528383, 3.799829994960948)
Bootstrap approximated 95% confidence interval of the mean rating is between 3.55 and 3.79. This means that we are 95% confident that the population mean, lies between 3.55 and 3.79.
Now, since we have the original data, we can check if the confidence interval is acceptable.
df['reviews.rating'].mean()
3.7764308131241124
Indeed, the mean of the original data is in the confidence interval that bootstrap approximated.
Summing it up
Bootstrap is a method to estimate the population characteristics from a sample. It’s very easy and straightforward and in python, can be applied by only using Pandas Dataframes. While Bootstrapping can be very useful, you should be very careful because the sample you will use, needs to be representative in order to capture the population characteristics adequately.