Predictive Hacks

# AB Testing in R

AB testing is an online technique used for comparing one Control to more versions of the original, with the purpose of determining the best performing one. The variants can be Subject Lines, Email Bodies, Web Pages, App Screens, Banners, etc and the KPI can be the Open Rate, the Click Through Rate the Conversion Rate, etc depending on company’s objectives. Our goal is to give a brief view of AB and ABn focusing mainly on R part without diving into maths details.

### Confidence Intervals

When we test a variant we get the observed “Response Rate” p which is just an estimate. Usually it is better to give a also the range of it by applying the Confidence Intervals.

$$\hat{p}\pm Z_{\frac{\alpha}{2}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

where:
$$\hat{p}$$ : Response Rate of the variant.
$$\alpha$$ : The Level of Significance which is usally 5%.
$$Z:$$ The Standard Normal Distribution.
$$n$$ : Is the sample size of the variant.

Example: Using R calculate the 95% Confidence Interval of the variant $$V_1$$ which has 150 Clicks and 900 Impressions

CI<-prop.test(x=150,n=900, correct = FALSE, conf.level = 0.95)
CI$conf.int  0.143746060287615 0.192420704199088 ### Hypothesis Testing of One Variant As we saw above the observed CTR of the $$V_1$$ was 16.67% and let’s assume that they asked us if the actual CTR of the $$V_1$$ could be 17%. In order to answer this type of questions, we apply Hypothesis Testing of Proportion. In this example, we apply the 2-sided tests. The Hypothesis can be written as: $$H_0: p=0.17$$ $$H_1: p\neq 0.17$$ Example: Using R test if the actual $$p$$ of the variant $$V_1$$ could be considered as 0.17 and then test again for 0.20 Hypothesis Testing for p=0.17 t1<-prop.test(x=150,n=900, p=0.17, alternative = c("two.sided"), conf.level = 0.95, correct = FALSE) t1   1-sample proportions test without continuity correction data: 150 out of 900, null probability 0.17 X-squared = 0.070872, df = 1, p-value = 0.7901 alternative hypothesis: true p is not equal to 0.17 95 percent confidence interval: 0.1437461 0.1924207 sample estimates: p 0.1666667  Hypothesis Testing for p=0.20 t2<-prop.test(x=150,n=900, p=0.20, correct = FALSE) t2   1-sample proportions test without continuity correction data: 150 out of 900, null probability 0.2 X-squared = 6.25, df = 1, p-value = 0.01242 alternative hypothesis: true p is not equal to 0.2 95 percent confidence interval: 0.1437461 0.1924207 sample estimates: p 0.1666667  ### Hypothesis Testing of Two Variants (AB Testing) We can apply the Z-Tet of proportions when we want to compare two variants about their Response Rates. Without going into details we represent the formula of the Z standard Normal of the difference of two binomial distributions. $$Z=\frac{\hat{p_1}-{\hat{p_2}}}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n1}+\frac{1}{n2})}}$$ where $$\hat{p}=\frac{x_1+x_2}{n_1+n_2}$$ is the average of the Response Rate of the two variants. The Hypothesis can be formulated as: $$H_0: p_1=p_2$$ $$H_1: p_1\neq p_2$$ Example: Using R calculate compare the variant $$V_1$$ which has 120 Clicks and 800 Impressions with variant $$V_2$$ which has 100 Clicks and 700 Impressions # define a vector of the responses x<-c(120,100) # define a vector of the impressions n<-c(800,700) test1<-prop.test(x,n, correct = FALSE) test1   2-sample test for equality of proportions without continuity correction data: x out of n X-squared = 0.15219, df = 1, p-value = 0.6964 alternative hypothesis: two.sided 95 percent confidence interval: -0.02869299 0.04297870 sample estimates: prop 1 prop 2 0.1500000 0.1428571  Notice: The z-test comparing two proportions is equivalent to the chi-square test of independence, and the prop.test( ) procedure formally calculates the chi-square test. The p-value from the z-test for two proportions is equal to the p-value from the chi-square test, and the z-statistic is equal to the square root of the chi-square statistic in this situation. As we can see R provides us a nice output which shows the proportion of each variant as well as the p-value. In our case, we do not reject the null hypothesis since the p-value (0.6964) is greater than 5%. ### Hypothesis Testing of k Variants (ABn Testing) In most cases, we test more than two variants and someone can ask if all of these variants can be considered as equivalent or not (i.e. if their responses rates (RR) are equivalent). In order to answer this question, we can apply the Chi-Square Test $$\chi^2$$. The Null and the Alternative hypothesis can be written as: $$H_0: p_1=p_2=…=p_K$$ $$H_1: The~RRs~Are~Not~All~Equal$$ Example: Assume that we have 8 variants with the following clicks (80,85,90,95,100,105,110,115) respectively and all of them have 1000 impressions. Using R determine if all these variants can be considered equivalent. x<-seq(from=80, by=5, length.out=8) n<-rep(1000,8) chisqtest<-prop.test(x,n) chisqtest   8-sample test for equality of proportions without continuity correction data: x out of n X-squared = 11.933, df = 7, p-value = 0.1028 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115  As see we can consider all these 8 variants as equivalent (p-value: 0.1028). However, if we compare just the first one with the last one it comes out their difference is statistically significant. This is due to the effect of “Multiple Comparisons”. ### Multiple Pairwise Comparisons Without P-Value Adjustments Using R we can easily represent the P-values of all pairwise comparisons. Let’s do it using the data of the above example x<-seq(from=80, by=5, length.out=8) n<-rep(1000,8) ppt<-pairwise.prop.test(x, n, p.adjust.method = "none") ppt   Pairwise comparisons using Pairwise comparison of proportions data: x out of n 1 2 3 4 5 6 7 2 0.745 - - - - - - 3 0.471 0.752 - - - - - 4 0.268 0.482 0.758 - - - - 5 0.138 0.280 0.492 0.763 - - - 6 0.064 0.147 0.291 0.502 0.768 - - 7 0.027 0.070 0.157 0.302 0.512 0.773 - 8 0.010 0.031 0.077 0.166 0.312 0.520 0.777 P value adjustment method: none  ### Multiple Pairwise Comparisons With P-Value Adjustments Since we are dealing with Multiple Comparisons it is common to apply the p-value adjustments. R provides us the following methods of p-value adjustments. {“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”} Let’s apply the above example using the False Discovery Rate as method of adjustment. x<-seq(from=80, by=5, length.out=8) n<-rep(1000,8) ppt<-pairwise.prop.test(x, n, p.adjust.method = "fdr") ppt   Pairwise comparisons using Pairwise comparison of proportions data: x out of n 1 2 3 4 5 6 7 2 0.78 - - - - - - 3 0.69 0.78 - - - - - 4 0.58 0.69 0.78 - - - - 5 0.46 0.58 0.69 0.78 - - - 6 0.36 0.46 0.58 0.69 0.78 - - 7 0.29 0.36 0.46 0.58 0.69 0.78 - 8 0.29 0.29 0.36 0.46 0.58 0.69 0.78 P value adjustment method: fdr  As we can see now, none of the pairs can be considered as statistically significant different. ### Multiple Pairwise Comparisons of Control Variant With P-Value Adjustments Sometimes we want to compare only the Control versus the rest variants. In this case, we need to take the p-values of the Control vs the rest variants using none adjustment and then to apply the p-value adjustments. Again let’s use the same data and assuming that the Control is the $$V_1$$ x<-seq(from=80, by=5, length.out=8) n<-rep(1000,8) ppt<-pairwise.prop.test(x, n, p.adjust.method = "none") # this vector is the p-values of variant 1 versus the rest 7 variants without adjustments pvalue_vector<-ppt$p.value[,1]
pvalue_vector

         2          3          4          5          6          7          8
0.74510651 0.47052922 0.26791381 0.13766142 0.06398843 0.02699772 0.01037907 
# now apply the pvalue adjustment to the vector of pvalues

         2          3          4          5          6          7          8
0.74510651 0.54895076 0.37507933 0.24090748 0.14930633 0.09449203 0.07265352 

### Multiple Comparisons applying TukeyHSD Test

We can also run a Logistic Regression applying the Tukey Test. Let’s apply it

library(multcomp)

dataset<-data.frame(x=seq(from=80, by=5, length.out=8), n=rep(1000,8), ID=factor(c(1:8)))
dataset
model1<- glm(formula = cbind(x, n-x) ~ ID, family = binomial(link = "logit"), data=dataset)

# Tukey multiple comparisons
summary(glht(model1, mcp(ID="Tukey")))


x	n	ID
80	1000	1
85	1000	2
90	1000	3
95	1000	4
100	1000	5
105	1000	6
110	1000	7
115	1000	8
	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Fit: glm(formula = cbind(x, n - x) ~ ID, family = binomial(link = "logit"),
data = dataset)

Linear Hypotheses:
Estimate Std. Error z value Pr(>|z|)
2 - 1 == 0  0.06607    0.16262   0.406    1.000
3 - 1 == 0  0.12871    0.16061   0.801    0.993
4 - 1 == 0  0.18829    0.15880   1.186    0.936
5 - 1 == 0  0.24512    0.15716   1.560    0.774
6 - 1 == 0  0.29948    0.15565   1.924    0.534
7 - 1 == 0  0.35161    0.15428   2.279    0.306
8 - 1 == 0  0.40169    0.15301   2.625    0.146
3 - 2 == 0  0.06264    0.15833   0.396    1.000
4 - 2 == 0  0.12221    0.15649   0.781    0.994
5 - 2 == 0  0.17905    0.15482   1.157    0.944
6 - 2 == 0  0.23341    0.15329   1.523    0.795
7 - 2 == 0  0.28553    0.15190   1.880    0.564
8 - 2 == 0  0.33562    0.15061   2.228    0.334
4 - 3 == 0  0.05958    0.15441   0.386    1.000
5 - 3 == 0  0.11641    0.15271   0.762    0.995
6 - 3 == 0  0.17077    0.15117   1.130    0.950
7 - 3 == 0  0.22289    0.14975   1.488    0.814
8 - 3 == 0  0.27298    0.14844   1.839    0.593
5 - 4 == 0  0.05683    0.15081   0.377    1.000
6 - 4 == 0  0.11119    0.14924   0.745    0.996
7 - 4 == 0  0.16332    0.14780   1.105    0.956
8 - 4 == 0  0.21340    0.14648   1.457    0.830
6 - 5 == 0  0.05436    0.14749   0.369    1.000
7 - 5 == 0  0.10648    0.14603   0.729    0.996
8 - 5 == 0  0.15657    0.14470   1.082    0.961
7 - 6 == 0  0.05212    0.14441   0.361    1.000
8 - 6 == 0  0.10221    0.14306   0.714    0.997
8 - 7 == 0  0.05009    0.14156   0.354    1.000
(Adjusted p values reported -- single-step method)

### 1 thought on “AB Testing in R”

1. Hi, this is a comment.
To get started with moderating, editing, and deleting comments, please visit the Comments screen in the dashboard.
Commenter avatars come from Gravatar.

### Get updates and learn from the best

Python

#### How to Detect Trends in Cryptocurrencies with ADX using Kraken API

You may have heard the terms “bull” and “bear” markets, which are terms to describe a trend in the market.

Python

#### How to create Heatmap on a Map in Python

In this post, we will show you how to create a heatmap on an actual map using Plotly. What we