In many cases, there is a need to split a userbase into 2 or more buckets. For example:
- UCG: Many companies that run promotional campaigns, in order to quantify and evaluate the performance of the campaigns, create a Universal Control Group (UCG) which is a random sample of the userbase and does not receive any offer or message.
- Bucketize: For testing purposes, it is common to split the userbase into buckets so that to be able to compare them in a long term.
- Samples for Machine Learning: A userbase can become too large for a machine learning model to run and for that reason, it is common to get random samples.
The requirements
For the cases that we mentioned above, the splitting algorithm must satisfy the following two requirements:
- There should be a mapping function so that every time we encounter an existing user to be assigned to the same group. For instance, if the UserID
152514
was initially assigned to UCG, then it will always be to UCG group. - There should be a mapping function so that every new user to be assigned to a group.
We can fulfill the requirements above by applying the modulo operation.
Example of Splitting the Userbase with Modulo
Let’s see how we can split the Userbase into two buckets. Let’s say that we want 20% of the users to be in UCG and the rest 80% to be Control. Usually, the UserIDs will be hashed, according to GDPR compliance. Below we generate some random data:
library(tidyverse) library(digest) library(Rmpfr) set.seed(5) df<-tibble(Row_Number = seq(1,100000)) df<-df%>%rowwise%>%mutate(Hash_Name = digest(paste(sample(LETTERS, 10, replace = TRUE), collapse = ""), algo="md5", serialize=F), Event_Date = lubridate::as_datetime( runif(1, 1546290000, 1577739600))) head(df)
Output:
# A tibble: 6 x 3
# Rowwise:
Row_Number Hash_Name Event_Date
<int> <chr> <dttm>
1 1 275db34231203750f10adb24c76b9619 2019-06-10 06:15:33
2 2 9a449c58ac6baed3b3648f0f3b5f8084 2019-03-27 21:38:34
3 3 e28e89ab554739a982c862cccf024464 2019-12-02 15:43:48
4 4 45b9aea890d3b98419cae72bb497e94b 2019-10-18 18:58:23
5 5 c4ce7434621d08f5195fbd1bfc1c20c2 2019-08-09 06:14:45
6 6 0b8a304be1015cacfcf31dd40ef6a381 2019-04-10 08:07:28
In order to generate random numbers, it is better to choose a prime number for the modulo operation. For this example, we will take the 997 which is a prime number. The other thing that we need to do, is to convert the MD5 Hashed to numeric. We can do it with the Rmpfr
library in R. To sum up:
- We will convert the MD5 to numeric
- We will divide the above number by 997 and we will keep store the remainder
df$Remainder <- as.numeric(mpfr(df$Hash_Name, base=16) %% 997)
Is it Random?
This approach generates pseudo-random numbers. Let’s see if the distribution of the numbers (from 0 to 996) is random.
hist(df$Remainder)
We can apply a Chi-Square test too.
chisq.test(table(df$Remainder))
Output:
Chi-squared test for given probabilities
data: table(df$Remainder)
X-squared = 995.2, df = 996, p-value = 0.5012
The P-value is 0.5012 which implies that the generated numbers can be considered random.
Now, we can split our UB into UCG and Control as follows:
If the remainder is less than 200 then UCG else Control
df$Group <- ifelse(df$Remainder<200, 'UCG', 'Control') df
Check the Proportions
Finally, we want to make sure that the proportion is 80% vs 20% for Control and UCG respectively.
prop.table(table(df$Group))
Output:
Control UCG
0.80002 0.19998
Conclusion
We can use the modulo function to split a userbase in a reproducible and efficient way.