A very common task in data processing is the transformation of the numeric variables (continuous, discrete etc) to categorical by creating bins. For example, is quite ofter to convert the
age to the
age group. Let’s see how we can easily do that in R.
We will consider a random variable from the Poisson distribution with parameter λ=20
library(dplyr) # Generate 1000 observations from the Poisson distribution # with lambda equal to 20 df<-data.frame(MyContinuous = rpois(1000,20)) # get the histogtam hist(df$MyContinuous)
Create specific Bins
Let’s say that you want to create the following bins:
- Bin 1: (-inf, 15]
- Bin 2: (15,25]
- Bin 3: (25, inf)
We can easily do that using the
cut command. Let’s start:
df<-df%>%mutate(MySpecificBins = cut(MyContinuous, breaks = c(-Inf,15,25,Inf))) head(df,10)
Let’s have a look at the counts of each bin.
Notice that you can define also you own labels within the
Create Bins based on Quantiles
Let’s say that you want each bin to have the same number of observations, like for example 4 bins of an equal number of observations, i.e. 25% each. We can easily do it as follows:
numbers_of_bins = 4 df<-df%>%mutate(MyQuantileBins = cut(MyContinuous, breaks = unique(quantile(MyContinuous,probs=seq.int(0,1, by=1/numbers_of_bins))), include.lowest=TRUE)) head(df,10)
We can check the
MyQuantileBins if contain the same number of observations, and also to look at their ranges:
Notice that in case that you want to split your continuous variable into bins of equal size you can also use the
ntile function of the
dplyr package, but it does not create labels of the bins based on the ranges.
Want to Build Bins in Python?
Do you want to create bins in Python? You can have a look at our post!
6 thoughts on “How to Convert Continuous variables into Categorical by Creating Bins”
Thanks for the post
I am getting an error when I tried to use this command
Error: Problem with `mutate()` input `gest_bins`.
x missing value where TRUE/FALSE needed
ℹ Input `gest_bins` is `cut(gest = cut(-Inf, 28, 35, Inf))`.
Run `rlang::last_error()` to see where the error occurred.
Is this due to missing values and in that case, how to solve that?
Please provide me a reproducible example
The issue is how best to decide the number of bins and their width. Check out how cartographers have handled the same issue when choropleth mapping in a brilliant paper:
The Selection of Class Intervals
Author(s): Ian S. Evans
Transactions of the Institute of British Geographers, New Series, Vol. 2, No. 1,Contemporary Cartography (1977), http://www.jstor.org/stable/622195 .
This is a nice little post. It is similar to the one I did in SAS (not R a while ago..