Predictive Hacks

AB Testing in R

AB Testing

AB testing is an online technique used for comparing one Control to more versions of the original, with the purpose of determining the best performing one. The variants can be Subject Lines, Email Bodies, Web Pages, App Screens, Banners, etc and the KPI can be the Open Rate, the Click Through Rate the Conversion Rate, etc depending on company’s objectives. Our goal is to give a brief view of AB and ABn focusing mainly on R part without diving into maths details.

Confidence Intervals

When we test a variant we get the observed “Response Rate” p which is just an estimate. Usually it is better to give a also the range of it by applying the Confidence Intervals.

\(\hat{p}\pm Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

where:
\( \hat{p} \) : Response Rate of the variant.
\( \alpha \) : The Level of Significance which is usally 5%.
\( Z: \) The Standard Normal Distribution.
\( n \) : Is the sample size of the variant.

Example: Using R calculate the 95% Confidence Interval of the variant \(V_1\) which has 150 Clicks and 900 Impressions

CI<-prop.test(x=150,n=900, correct = FALSE, conf.level = 0.95)
CI$conf.int
0.143746060287615 0.192420704199088

Hypothesis Testing of One Variant

As we saw above the observed CTR of the \(V_1\) was 16.67% and let’s assume that they asked us if the actual CTR of the \(V_1\) could be 17%. In order to answer this type of questions, we apply Hypothesis Testing of Proportion. In this example, we apply the 2-sided tests. The Hypothesis can be written as:

\(H_0: p=0.17\)
\(H_1: p\neq 0.17\)

Example: Using R test if the actual \(p\) of the variant \(V_1\) could be considered as 0.17 and then test again for 0.20

Hypothesis Testing for p=0.17

t1<-prop.test(x=150,n=900, p=0.17, alternative = c("two.sided"), conf.level = 0.95, correct = FALSE)
t1

	1-sample proportions test without continuity correction

data:  150 out of 900, null probability 0.17
X-squared = 0.070872, df = 1, p-value = 0.7901
alternative hypothesis: true p is not equal to 0.17
95 percent confidence interval:
 0.1437461 0.1924207
sample estimates:
        p 
0.1666667 

Hypothesis Testing for p=0.20

t2<-prop.test(x=150,n=900, p=0.20, correct = FALSE)
t2

	1-sample proportions test without continuity correction

data:  150 out of 900, null probability 0.2
X-squared = 6.25, df = 1, p-value = 0.01242
alternative hypothesis: true p is not equal to 0.2
95 percent confidence interval:
 0.1437461 0.1924207
sample estimates:
        p 
0.1666667 

Hypothesis Testing of Two Variants (AB Testing)

We can apply the Z-Tet of proportions when we want to compare two variants about their Response Rates. Without going into details we represent the formula of the Z standard Normal of the difference of two binomial distributions.

\(Z=\frac{\hat{p_1}-{\hat{p_2}}}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n1}+\frac{1}{n2})}}\)


where \(\hat{p}=\frac{x_1+x_2}{n_1+n_2}\) is the average of the Response Rate of the two variants.
The Hypothesis can be formulated as:

\(H_0: p_1=p_2\)
\(H_1: p_1\neq p_2\)

Example: Using R calculate compare the variant \(V_1\) which has 120 Clicks and 800 Impressions with variant \(V_2\) which has 100 Clicks and 700 Impressions

# define a vector of the responses
x<-c(120,100)
# define a vector of the impressions
n<-c(800,700)
test1<-prop.test(x,n, correct = FALSE)
test1

	2-sample test for equality of proportions without continuity
	correction

data:  x out of n
X-squared = 0.15219, df = 1, p-value = 0.6964
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.02869299  0.04297870
sample estimates:
   prop 1    prop 2 
0.1500000 0.1428571 

Notice: The z-test comparing two proportions is equivalent to the chi-square test of independence, and the prop.test( ) procedure formally calculates the chi-square test. The p-value from the z-test for two proportions is equal to the p-value from the chi-square test, and the z-statistic is equal to the square root of the chi-square statistic in this situation.
As we can see R provides us a nice output which shows the proportion of each variant as well as the p-value. In our case, we do not reject the null hypothesis since the p-value (0.6964) is greater than 5%.


Hypothesis Testing of k Variants (ABn Testing)

In most cases, we test more than two variants and someone can ask if all of these variants can be considered as equivalent or not (i.e. if their responses rates (RR) are equivalent).
In order to answer this question, we can apply the Chi-Square Test \(\chi^2\).
The Null and the Alternative hypothesis can be written as:

\(H_0: p_1=p_2=…=p_K\)
\(H_1: The~RRs~Are~Not~All~Equal\)

Example: Assume that we have 8 variants with the following clicks (80,85,90,95,100,105,110,115) respectively and all of them have 1000 impressions. Using R determine if all these variants can be considered equivalent.

x<-seq(from=80, by=5, length.out=8)
n<-rep(1000,8)
chisqtest<-prop.test(x,n)
chisqtest

	8-sample test for equality of proportions without continuity
	correction

data:  x out of n
X-squared = 11.933, df = 7, p-value = 0.1028
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8 
 0.080  0.085  0.090  0.095  0.100  0.105  0.110  0.115 

As see we can consider all these 8 variants as equivalent (p-value: 0.1028). However, if we compare just the first one with the last one it comes out their difference is statistically significant. This is due to the effect of “Multiple Comparisons”.


Multiple Pairwise Comparisons Without P-Value Adjustments

Using R we can easily represent the P-values of all pairwise comparisons. Let’s do it using the data of the above example

x<-seq(from=80, by=5, length.out=8)
n<-rep(1000,8)
ppt<-pairwise.prop.test(x, n, p.adjust.method = "none")
ppt

	Pairwise comparisons using Pairwise comparison of proportions 

data:  x out of n 

  1     2     3     4     5     6     7    
2 0.745 -     -     -     -     -     -    
3 0.471 0.752 -     -     -     -     -    
4 0.268 0.482 0.758 -     -     -     -    
5 0.138 0.280 0.492 0.763 -     -     -    
6 0.064 0.147 0.291 0.502 0.768 -     -    
7 0.027 0.070 0.157 0.302 0.512 0.773 -    
8 0.010 0.031 0.077 0.166 0.312 0.520 0.777

P value adjustment method: none  

Multiple Pairwise Comparisons With P-Value Adjustments

Since we are dealing with Multiple Comparisons it is common to apply the p-value adjustments. R provides us the following methods of p-value adjustments.

{“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”}

Let’s apply the above example using the False Discovery Rate as method of adjustment.

x<-seq(from=80, by=5, length.out=8)
n<-rep(1000,8)
ppt<-pairwise.prop.test(x, n, p.adjust.method = "fdr")
ppt
	Pairwise comparisons using Pairwise comparison of proportions 

data:  x out of n 

  1    2    3    4    5    6    7   
2 0.78 -    -    -    -    -    -   
3 0.69 0.78 -    -    -    -    -   
4 0.58 0.69 0.78 -    -    -    -   
5 0.46 0.58 0.69 0.78 -    -    -   
6 0.36 0.46 0.58 0.69 0.78 -    -   
7 0.29 0.36 0.46 0.58 0.69 0.78 -   
8 0.29 0.29 0.36 0.46 0.58 0.69 0.78

P value adjustment method: fdr 

As we can see now, none of the pairs can be considered as statistically significant different.


Multiple Pairwise Comparisons of Control Variant With P-Value Adjustments

Sometimes we want to compare only the Control versus the rest variants. In this case, we need to take the p-values of the Control vs the rest variants using none adjustment and then to apply the p-value adjustments.
Again let’s use the same data and assuming that the Control is the \(V_1\)

x<-seq(from=80, by=5, length.out=8)
n<-rep(1000,8)
ppt<-pairwise.prop.test(x, n, p.adjust.method = "none")
# this vector is the p-values of variant 1 versus the rest 7 variants without adjustments
pvalue_vector<-ppt$p.value[,1]
pvalue_vector
         2          3          4          5          6          7          8 
0.74510651 0.47052922 0.26791381 0.13766142 0.06398843 0.02699772 0.01037907 
# now apply the pvalue adjustment to the vector of pvalues
p.adjust(pvalue_vector, method = "fdr")
         2          3          4          5          6          7          8 
0.74510651 0.54895076 0.37507933 0.24090748 0.14930633 0.09449203 0.07265352 

Multiple Comparisons applying TukeyHSD Test

We can also run a Logistic Regression applying the Tukey Test. Let’s apply it

library(multcomp)

dataset<-data.frame(x=seq(from=80, by=5, length.out=8), n=rep(1000,8), ID=factor(c(1:8)))
dataset
model1<- glm(formula = cbind(x, n-x) ~ ID, family = binomial(link = "logit"), data=dataset)

# Tukey multiple comparisons
summary(glht(model1, mcp(ID="Tukey")))

x	n	ID
80	1000	1
85	1000	2
90	1000	3
95	1000	4
100	1000	5
105	1000	6
110	1000	7
115	1000	8
	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = cbind(x, n - x) ~ ID, family = binomial(link = "logit"), 
    data = dataset)

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)
2 - 1 == 0  0.06607    0.16262   0.406    1.000
3 - 1 == 0  0.12871    0.16061   0.801    0.993
4 - 1 == 0  0.18829    0.15880   1.186    0.936
5 - 1 == 0  0.24512    0.15716   1.560    0.774
6 - 1 == 0  0.29948    0.15565   1.924    0.534
7 - 1 == 0  0.35161    0.15428   2.279    0.306
8 - 1 == 0  0.40169    0.15301   2.625    0.146
3 - 2 == 0  0.06264    0.15833   0.396    1.000
4 - 2 == 0  0.12221    0.15649   0.781    0.994
5 - 2 == 0  0.17905    0.15482   1.157    0.944
6 - 2 == 0  0.23341    0.15329   1.523    0.795
7 - 2 == 0  0.28553    0.15190   1.880    0.564
8 - 2 == 0  0.33562    0.15061   2.228    0.334
4 - 3 == 0  0.05958    0.15441   0.386    1.000
5 - 3 == 0  0.11641    0.15271   0.762    0.995
6 - 3 == 0  0.17077    0.15117   1.130    0.950
7 - 3 == 0  0.22289    0.14975   1.488    0.814
8 - 3 == 0  0.27298    0.14844   1.839    0.593
5 - 4 == 0  0.05683    0.15081   0.377    1.000
6 - 4 == 0  0.11119    0.14924   0.745    0.996
7 - 4 == 0  0.16332    0.14780   1.105    0.956
8 - 4 == 0  0.21340    0.14648   1.457    0.830
6 - 5 == 0  0.05436    0.14749   0.369    1.000
7 - 5 == 0  0.10648    0.14603   0.729    0.996
8 - 5 == 0  0.15657    0.14470   1.082    0.961
7 - 6 == 0  0.05212    0.14441   0.361    1.000
8 - 6 == 0  0.10221    0.14306   0.714    0.997
8 - 7 == 0  0.05009    0.14156   0.354    1.000
(Adjusted p values reported -- single-step method)

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “AB Testing in R”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s