Predictive Hacks

Bootstrap Sampling using Python

bootstrap python

Bootstrapping is a method that estimates the population characteristics by using repeated sampling of a representative sample. In this post, we will use bootstrap in a real case scenario in which we will try to estimate the confidence interval of the population mean.

For this example, we are using the hotel-reviews dataset from kaggle.com

Let’s import the libraries, load the data and draw a sample of 500 rows to use as our representative sample of the “population”.

import pandas as pd
import numpy as np


#moview reviews
df=pd.read_csv('archive/7282_1.csv')

#sample 
s=df.sample(500,random_state=7)

s.head()

The Scenario

Let’s say that we are given this sample of 500 rows and we want somehow to estimate the 95% confidence interval of the mean. We can start by computing the summary statistics and plot the histogram of the ratings.

s['reviews.rating'].hist()
s['reviews.rating'].describe()

The histogram of the ratings is not a recognizable form like a normal distribution. Its mean is 3.67 but the question is what can we say about the true mean value. We cannot build a confidence interval by simulating the sampling distribution because we don’t know how to describe it.

Using Bootstrap will estimate the uncertainty of the mean by generating samples from our data and then characterize the distribution of the mean over these samples.

We will sample our data “with replacement“. That means we draw random ratings allowing the same rating to be drawn again.

bootstrap in python
#bootstrapping
bootstrap=pd.DataFrame({'mean_rating':[s.sample(500,replace=True)['reviews.rating'].mean() for i in range(0,1000)]})

bootstrap

We created a dataframe having the mean ratings of 1000 samples. Plotting the histogram of the bootstrapped samples, we can clearly see that it approximates a normal distribution(in line with the central limit theorem).

bootstrap['mean_rating'].hist()

Now, we can extract the quantiles:

(bootstrap['mean_rating'].quantile(0.025),bootstrap['mean_rating'].quantile(0.975))
(3.554730600528383, 3.799829994960948)

Bootstrap approximated 95% confidence interval of the mean rating is between 3.55 and 3.79. This means that we are 95% confident that the population mean, lies between 3.55 and 3.79.

Now, since we have the original data, we can check if the confidence interval is acceptable.

df['reviews.rating'].mean()
3.7764308131241124

Indeed, the mean of the original data is in the confidence interval that bootstrap approximated.

Summing it up

Bootstrap is a method to estimate the population characteristics from a sample. It’s very easy and straightforward and in python, can be applied by only using Pandas Dataframes. While Bootstrapping can be very useful, you should be very careful because the sample you will use, needs to be representative in order to capture the population characteristics adequately.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s