In Data Science project, it is common to replace the missing values of the categorical variables with the mode. Let’s see the following example:
df<-data.frame(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), ColumnB=c("A","B","A","A","","B","A","B","","A"), ColumnC=c("","BB","CC","BB","BB","CC","AA","BB","","AA"), ColumnD=c(NA,20,18,22,18,17,19,NA,17,23) ) df
Note that the ColumnB and ColumnC are Character
columns. Note also that there is no mode function in R. So let’s build it:
getmode <- function(v){ v=v[nchar(as.character(v))>0] uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] }
Now let’s replace all the empty strings of the Character
variables with their corresponding column mode. Finally, we should convert the character
variables to factors
.
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], function(x) ifelse(x=="", getmode(x), x)) df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor) df
As we can see, we replaced the empty strings with the corresponding mode.