Predictive Hacks

How to Split Randomly a Userbase using Modulo

modulo

In many cases, there is a need to split a userbase into 2 or more buckets. For example:

  • UCG: Many companies that run promotional campaigns, in order to quantify and evaluate the performance of the campaigns, create a Universal Control Group (UCG) which is a random sample of the userbase and does not receive any offer or message.
  • Bucketize: For testing purposes, it is common to split the userbase into buckets so that to be able to compare them in a long term.
  • Samples for Machine Learning: A userbase can become too large for a machine learning model to run and for that reason, it is common to get random samples.

The requirements

For the cases that we mentioned above, the splitting algorithm must satisfy the following two requirements:

  1. There should be a mapping function so that every time we encounter an existing user to be assigned to the same group. For instance, if the UserID 152514 was initially assigned to UCG, then it will always be to UCG group.
  2. There should be a mapping function so that every new user to be assigned to a group.

We can fulfill the requirements above by applying the modulo operation.

Example of Splitting the Userbase with Modulo

Let’s see how we can split the Userbase into two buckets. Let’s say that we want 20% of the users to be in UCG and the rest 80% to be Control. Usually, the UserIDs will be hashed, according to GDPR compliance. Below we generate some random data:

library(tidyverse)
library(digest)
library(Rmpfr)
set.seed(5)


df<-tibble(Row_Number = seq(1,100000))


df<-df%>%rowwise%>%mutate(Hash_Name = digest(paste(sample(LETTERS, 10, replace = TRUE), collapse = ""), 
                                             algo="md5", serialize=F),
                          Event_Date = lubridate::as_datetime( runif(1, 1546290000, 1577739600)))


head(df)

Output:

# A tibble: 6 x 3
# Rowwise: 
  Row_Number Hash_Name                        Event_Date         
       <int> <chr>                            <dttm>             
1          1 275db34231203750f10adb24c76b9619 2019-06-10 06:15:33
2          2 9a449c58ac6baed3b3648f0f3b5f8084 2019-03-27 21:38:34
3          3 e28e89ab554739a982c862cccf024464 2019-12-02 15:43:48
4          4 45b9aea890d3b98419cae72bb497e94b 2019-10-18 18:58:23
5          5 c4ce7434621d08f5195fbd1bfc1c20c2 2019-08-09 06:14:45
6          6 0b8a304be1015cacfcf31dd40ef6a381 2019-04-10 08:07:28

In order to generate random numbers, it is better to choose a prime number for the modulo operation. For this example, we will take the 997 which is a prime number. The other thing that we need to do, is to convert the MD5 Hashed to numeric. We can do it with the Rmpfr library in R. To sum up:

  • We will convert the MD5 to numeric
  • We will divide the above number by 997 and we will keep store the remainder
df$Remainder <- as.numeric(mpfr(df$Hash_Name, base=16) %% 997)

Is it Random?

This approach generates pseudo-random numbers. Let’s see if the distribution of the numbers (from 0 to 996) is random.

hist(df$Remainder)
How to Split Randomly a Userbase using Modulo 1

We can apply a Chi-Square test too.

chisq.test(table(df$Remainder))

Output:

	Chi-squared test for given probabilities

data:  table(df$Remainder)
X-squared = 995.2, df = 996, p-value = 0.5012

The P-value is 0.5012 which implies that the generated numbers can be considered random.

Now, we can split our UB into UCG and Control as follows:

If the remainder is less than 200 then UCG else Control

df$Group <- ifelse(df$Remainder<200, 'UCG', 'Control')

df

How to Split Randomly a Userbase using Modulo 2

Check the Proportions

Finally, we want to make sure that the proportion is 80% vs 20% for Control and UCG respectively.

prop.table(table(df$Group))

Output:

Control     UCG 
0.80002 0.19998 

Conclusion

We can use the modulo function to split a userbase in a reproducible and efficient way.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

snowflake
Miscellaneous

How to Schedule Tasks in Snowflake

We have started a series of Snowflake tutorials, like How to Get Data from Snowflake using Python, How to Load