AB testing is an online technique used for comparing one Control to more versions of the original, with the purpose of determining the best performing one. The variants can be Subject Lines, Email Bodies, Web Pages, App Screens, Banners, etc and the KPI can be the Open Rate, the Click Through Rate the Conversion Rate, etc depending on company’s objectives. Our goal is to give a brief view of AB and ABn focusing mainly on R part without diving into maths details.
Confidence Intervals
When we test a variant we get the observed “Response Rate” p
which is just an estimate. Usually it is better to give a also the range of it by applying the Confidence Intervals.
\(\hat{p}\pm Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
where:
\( \hat{p} \) : Response Rate of the variant.
\( \alpha \) : The Level of Significance which is usally 5%.
\( Z: \) The Standard Normal Distribution.
\( n \) : Is the sample size of the variant.
Example: Using R calculate the 95% Confidence Interval of the variant \(V_1\) which has 150 Clicks and 900 Impressions
CI<-prop.test(x=150,n=900, correct = FALSE, conf.level = 0.95) CI$conf.int
0.143746060287615 0.192420704199088
Hypothesis Testing of One Variant
As we saw above the observed CTR of the \(V_1\) was 16.67% and let’s assume that they asked us if the actual CTR of the \(V_1\) could be 17%. In order to answer this type of questions, we apply Hypothesis Testing of Proportion. In this example, we apply the 2-sided tests. The Hypothesis can be written as:
\(H_0: p=0.17\)
\(H_1: p\neq 0.17\)
Example: Using R test if the actual \(p\) of the variant \(V_1\) could be considered as 0.17 and then test again for 0.20
Hypothesis Testing for p=0.17
t1<-prop.test(x=150,n=900, p=0.17, alternative = c("two.sided"), conf.level = 0.95, correct = FALSE) t1
1-sample proportions test without continuity correction
data: 150 out of 900, null probability 0.17
X-squared = 0.070872, df = 1, p-value = 0.7901
alternative hypothesis: true p is not equal to 0.17
95 percent confidence interval:
0.1437461 0.1924207
sample estimates:
p
0.1666667
Hypothesis Testing for p=0.20
t2<-prop.test(x=150,n=900, p=0.20, correct = FALSE) t2
1-sample proportions test without continuity correction
data: 150 out of 900, null probability 0.2
X-squared = 6.25, df = 1, p-value = 0.01242
alternative hypothesis: true p is not equal to 0.2
95 percent confidence interval:
0.1437461 0.1924207
sample estimates:
p
0.1666667
Hypothesis Testing of Two Variants (AB Testing)
We can apply the Z-Tet of proportions when we want to compare two variants about their Response Rates. Without going into details we represent the formula of the Z standard Normal of the difference of two binomial distributions.
\(Z=\frac{\hat{p_1}-{\hat{p_2}}}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n1}+\frac{1}{n2})}}\)
where \(\hat{p}=\frac{x_1+x_2}{n_1+n_2}\) is the average of the Response Rate of the two variants.
The Hypothesis can be formulated as:
\(H_0: p_1=p_2\)
\(H_1: p_1\neq p_2\)
Example: Using R calculate compare the variant \(V_1\) which has 120 Clicks and 800 Impressions with variant \(V_2\) which has 100 Clicks and 700 Impressions
# define a vector of the responses x<-c(120,100) # define a vector of the impressions n<-c(800,700) test1<-prop.test(x,n, correct = FALSE) test1
2-sample test for equality of proportions without continuity
correction
data: x out of n
X-squared = 0.15219, df = 1, p-value = 0.6964
alternative hypothesis: two.sided
95 percent confidence interval:
-0.02869299 0.04297870
sample estimates:
prop 1 prop 2
0.1500000 0.1428571
Notice: The z-test comparing two proportions is equivalent to the chi-square test of independence, and the prop.test( ) procedure formally calculates the chi-square test. The p-value from the z-test for two proportions is equal to the p-value from the chi-square test, and the z-statistic is equal to the square root of the chi-square statistic in this situation.
As we can see R provides us a nice output which shows the proportion of each variant as well as the p-value. In our case, we do not reject the null hypothesis since the p-value (0.6964) is greater than 5%.
Hypothesis Testing of k Variants (ABn Testing)
In most cases, we test more than two variants and someone can ask if all of these variants can be considered as equivalent or not (i.e. if their responses rates (RR) are equivalent).
In order to answer this question, we can apply the Chi-Square Test \(\chi^2\).
The Null and the Alternative hypothesis can be written as:
\(H_0: p_1=p_2=…=p_K\)
\(H_1: The~RRs~Are~Not~All~Equal\)
Example: Assume that we have 8 variants with the following clicks (80,85,90,95,100,105,110,115) respectively and all of them have 1000 impressions. Using R determine if all these variants can be considered equivalent.
x<-seq(from=80, by=5, length.out=8) n<-rep(1000,8) chisqtest<-prop.test(x,n) chisqtest
8-sample test for equality of proportions without continuity
correction
data: x out of n
X-squared = 11.933, df = 7, p-value = 0.1028
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8
0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115
As see we can consider all these 8 variants as equivalent (p-value: 0.1028). However, if we compare just the first one with the last one it comes out their difference is statistically significant. This is due to the effect of “Multiple Comparisons”.
Multiple Pairwise Comparisons Without P-Value Adjustments
Using R we can easily represent the P-values of all pairwise comparisons. Let’s do it using the data of the above example
x<-seq(from=80, by=5, length.out=8) n<-rep(1000,8) ppt<-pairwise.prop.test(x, n, p.adjust.method = "none") ppt
Pairwise comparisons using Pairwise comparison of proportions
data: x out of n
1 2 3 4 5 6 7
2 0.745 - - - - - -
3 0.471 0.752 - - - - -
4 0.268 0.482 0.758 - - - -
5 0.138 0.280 0.492 0.763 - - -
6 0.064 0.147 0.291 0.502 0.768 - -
7 0.027 0.070 0.157 0.302 0.512 0.773 -
8 0.010 0.031 0.077 0.166 0.312 0.520 0.777
P value adjustment method: none
Multiple Pairwise Comparisons With P-Value Adjustments
Since we are dealing with Multiple Comparisons it is common to apply the p-value adjustments. R provides us the following methods of p-value adjustments.
{“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”}
Let’s apply the above example using the False Discovery Rate as method of adjustment.
x<-seq(from=80, by=5, length.out=8) n<-rep(1000,8) ppt<-pairwise.prop.test(x, n, p.adjust.method = "fdr") ppt
Pairwise comparisons using Pairwise comparison of proportions
data: x out of n
1 2 3 4 5 6 7
2 0.78 - - - - - -
3 0.69 0.78 - - - - -
4 0.58 0.69 0.78 - - - -
5 0.46 0.58 0.69 0.78 - - -
6 0.36 0.46 0.58 0.69 0.78 - -
7 0.29 0.36 0.46 0.58 0.69 0.78 -
8 0.29 0.29 0.36 0.46 0.58 0.69 0.78
P value adjustment method: fdr
As we can see now, none of the pairs can be considered as statistically significant different.
Multiple Pairwise Comparisons of Control Variant With P-Value Adjustments
Sometimes we want to compare only the Control versus the rest variants. In this case, we need to take the p-values of the Control vs the rest variants using none adjustment and then to apply the p-value adjustments.
Again let’s use the same data and assuming that the Control is the \(V_1\)
x<-seq(from=80, by=5, length.out=8) n<-rep(1000,8) ppt<-pairwise.prop.test(x, n, p.adjust.method = "none") # this vector is the p-values of variant 1 versus the rest 7 variants without adjustments pvalue_vector<-ppt$p.value[,1] pvalue_vector
2 3 4 5 6 7 8
0.74510651 0.47052922 0.26791381 0.13766142 0.06398843 0.02699772 0.01037907
# now apply the pvalue adjustment to the vector of pvalues p.adjust(pvalue_vector, method = "fdr")
2 3 4 5 6 7 8
0.74510651 0.54895076 0.37507933 0.24090748 0.14930633 0.09449203 0.07265352
Multiple Comparisons applying TukeyHSD Test
We can also run a Logistic Regression applying the Tukey Test. Let’s apply it
library(multcomp) dataset<-data.frame(x=seq(from=80, by=5, length.out=8), n=rep(1000,8), ID=factor(c(1:8))) dataset model1<- glm(formula = cbind(x, n-x) ~ ID, family = binomial(link = "logit"), data=dataset) # Tukey multiple comparisons summary(glht(model1, mcp(ID="Tukey")))
x n ID
80 1000 1
85 1000 2
90 1000 3
95 1000 4
100 1000 5
105 1000 6
110 1000 7
115 1000 8
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: glm(formula = cbind(x, n - x) ~ ID, family = binomial(link = "logit"),
data = dataset)
Linear Hypotheses:
Estimate Std. Error z value Pr(>|z|)
2 - 1 == 0 0.06607 0.16262 0.406 1.000
3 - 1 == 0 0.12871 0.16061 0.801 0.993
4 - 1 == 0 0.18829 0.15880 1.186 0.936
5 - 1 == 0 0.24512 0.15716 1.560 0.774
6 - 1 == 0 0.29948 0.15565 1.924 0.534
7 - 1 == 0 0.35161 0.15428 2.279 0.306
8 - 1 == 0 0.40169 0.15301 2.625 0.146
3 - 2 == 0 0.06264 0.15833 0.396 1.000
4 - 2 == 0 0.12221 0.15649 0.781 0.994
5 - 2 == 0 0.17905 0.15482 1.157 0.944
6 - 2 == 0 0.23341 0.15329 1.523 0.795
7 - 2 == 0 0.28553 0.15190 1.880 0.564
8 - 2 == 0 0.33562 0.15061 2.228 0.334
4 - 3 == 0 0.05958 0.15441 0.386 1.000
5 - 3 == 0 0.11641 0.15271 0.762 0.995
6 - 3 == 0 0.17077 0.15117 1.130 0.950
7 - 3 == 0 0.22289 0.14975 1.488 0.814
8 - 3 == 0 0.27298 0.14844 1.839 0.593
5 - 4 == 0 0.05683 0.15081 0.377 1.000
6 - 4 == 0 0.11119 0.14924 0.745 0.996
7 - 4 == 0 0.16332 0.14780 1.105 0.956
8 - 4 == 0 0.21340 0.14648 1.457 0.830
6 - 5 == 0 0.05436 0.14749 0.369 1.000
7 - 5 == 0 0.10648 0.14603 0.729 0.996
8 - 5 == 0 0.15657 0.14470 1.082 0.961
7 - 6 == 0 0.05212 0.14441 0.361 1.000
8 - 6 == 0 0.10221 0.14306 0.714 0.997
8 - 7 == 0 0.05009 0.14156 0.354 1.000
(Adjusted p values reported -- single-step method)
1 thought on “AB Testing in R”
Hi, this is a comment.
To get started with moderating, editing, and deleting comments, please visit the Comments screen in the dashboard.
Commenter avatars come from Gravatar.