In many cases, there is a need to split a userbase into 2 or more buckets. For example:

**UCG:**Many companies that run promotional campaigns, in order to quantify and evaluate the performance of the campaigns, create a Universal Control Group (UCG) which is a random sample of the userbase and does not receive any offer or message.**Bucketize**: For testing purposes, it is common to split the userbase into buckets so that to be able to compare them in a long term.**Samples for Machine Learning**: A userbase can become too large for a machine learning model to run and for that reason, it is common to get random samples.

## The requirements

For the cases that we mentioned above, the splitting algorithm must satisfy the following two requirements:

- There should be a mapping function so that every time we encounter an
**existing user**to be assigned to the same group. For instance, if the**UserID**`152514`

was initially assigned to UCG, then it will always be to UCG group. - There should be a mapping function so that every
**new user**to be assigned to a group.

We can fulfill the requirements above by applying the modulo operation.

## Example of Splitting the Userbase with Modulo

Let’s see how we can split the Userbase into two buckets. Let’s say that we want 20% of the users to be in UCG and the rest 80% to be Control. Usually, the UserIDs will be hashed, according to GDPR compliance. Below we generate some random data:

library(tidyverse) library(digest) library(Rmpfr) set.seed(5) df<-tibble(Row_Number = seq(1,100000)) df<-df%>%rowwise%>%mutate(Hash_Name = digest(paste(sample(LETTERS, 10, replace = TRUE), collapse = ""), algo="md5", serialize=F), Event_Date = lubridate::as_datetime( runif(1, 1546290000, 1577739600))) head(df)

**Output:**

```
# A tibble: 6 x 3
# Rowwise:
Row_Number Hash_Name Event_Date
<int> <chr> <dttm>
1 1 275db34231203750f10adb24c76b9619 2019-06-10 06:15:33
2 2 9a449c58ac6baed3b3648f0f3b5f8084 2019-03-27 21:38:34
3 3 e28e89ab554739a982c862cccf024464 2019-12-02 15:43:48
4 4 45b9aea890d3b98419cae72bb497e94b 2019-10-18 18:58:23
5 5 c4ce7434621d08f5195fbd1bfc1c20c2 2019-08-09 06:14:45
6 6 0b8a304be1015cacfcf31dd40ef6a381 2019-04-10 08:07:28
```

In order to generate random numbers, it is better to choose a prime number for the modulo operation. For this example, we will take the **997** which is a prime number. The other thing that we need to do, is to** convert the MD5 Hashed to numeric**. We can do it with the `Rmpfr`

library in R. To sum up:

- We will convert the MD5 to numeric
- We will divide the above number by
**997**and we will keep store the remainder

df$Remainder <- as.numeric(mpfr(df$Hash_Name, base=16) %% 997)

**Is it Random**?

This approach generates pseudo-random numbers. Let’s see if the distribution of the numbers (from 0 to 996) is random.

hist(df$Remainder)

We can apply a Chi-Square test too.

chisq.test(table(df$Remainder))

**Output:**

```
Chi-squared test for given probabilities
data: table(df$Remainder)
X-squared = 995.2, df = 996, p-value = 0.5012
```

The** P-value is 0.5012** which implies that the generated numbers can be considered random.

Now, we can split our UB into **UCG **and **Control** as follows:

**If the remainder is less than 200 then UCG else Control**

df$Group <- ifelse(df$Remainder<200, 'UCG', 'Control') df

**Check the Proportions**

Finally, we want to make sure that the proportion is 80% vs 20% for Control and UCG respectively.

prop.table(table(df$Group))

**Output:**

```
Control UCG
0.80002 0.19998
```

## Conclusion

We can use the modulo function to split a userbase in a reproducible and efficient way.