Predictive Hacks

Replace Categorical Variables with Mode in R

In Data Science project, it is common to replace the missing values of the categorical variables with the mode. Let’s see the following example:

df<-data.frame(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), 
           ColumnB=c("A","B","A","A","","B","A","B","","A"),
           ColumnC=c("","BB","CC","BB","BB","CC","AA","BB","","AA"),
           ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
)

df

Note that the ColumnB and ColumnC are Character columns. Note also that there is no mode function in R. So let’s build it:

getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

Now let’s replace all the empty strings of the Character variables with their corresponding column mode. Finally, we should convert the character variables to factors.

df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], function(x) ifelse(x=="", getmode(x), x))
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
df

As we can see, we replaced the empty strings with the corresponding mode.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.