How to Impute Missing Values in R

Tags: impute missing values, NA

In the real data world, it is quite common to deal with Missing Values (known as NAs). Sometimes, there is a need to impute the missing values where the most common approaches are:

Numerical Data: Impute Missing Values with mean or median
Categorical Data: Impute Missing Values with mode

Let’s give an example of how we can impute dynamically depending on the data type.

library(tidyverse)

df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), 
           ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
           ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
           ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
           )

df

# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
   <int>   <dbl> <fct>   <fct>     <dbl>
 1     1      10 "A"     ""           NA
 2     2       9 "B"     "BB"         20
 3     3       8 "A"     "CC"         18
 4     4       7 "A"     "BB"         22
 5     5      NA ""      "BB"         18
 6     6      NA "B"     "CC"         17
 7     7      20 "A"     "AA"         19
 8     8      15 "B"     "BB"         NA
 9     9      12 ""      ""           17
10    10      NA "A"     "AA"         23

For the Categorical Variables, we are going to apply the “mode” function which we have to build it since it is not provided by R.

getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

Now that we have the “mode” function we are ready to impute the missing values of a dataframe depending on the data type of the columns. Thus, if the column data type is “numeric” we will impute it with the “mean” otherwise with the “mode“. Notice that in our script we take into account the column names and “dplyr” package requires a special notation (!!cols : = !!rlang::sym(colname)) of selecting dynamically the column names.

for (cols in colnames(df)) {
  if (cols %in% names(df[,sapply(df, is.numeric)])) {
    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))
    
  }
  else {
    
    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))
    
  }
}

df

> df
# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
   <dbl>   <dbl> <fct>   <fct>     <dbl>
 1     1    10   A       BB         19.2
 2     2     9   B       BB         20  
 3     3     8   A       CC         18  
 4     4     7   A       BB         22  
 5     5    11.6 A       BB         18  
 6     6    11.6 B       CC         17  
 7     7    20   A       AA         19  
 8     8    15   B       BB         19.2
 9     9    12   A       BB         17  
10    10    11.6 A       AA         23

Voilà! The missing values have been imputed!

Tags: impute missing values, NA

Share This Post

3 thoughts on “How to Impute Missing Values in R”

Jason Bryer

September 1, 2020 at 10:46 pm

I would encourage you to look into multiple imputation (the mice package is really good). Mean and mode imputation is almost worst than doing nothing as it will increase the bias and shrink error estimates.
Reply
- George Pipis
  
  September 3, 2020 at 9:11 am
  
  Thank you Jason. I agree that is risky to impute missing values with mean and median when it comes to predictive models. Our goal here is to show how you can do some things. Regarding the `mice` package. It returns the missing values based on probabilistic models which also can be biased.
  Reply
Ejner Borsting

March 11, 2023 at 10:30 am

Hello.
This gives no error but no imputation either. Can anyone help?
E. Borsting

>
> library(tidyverse)
>
>
>
> class(df)
[1] “data.frame”
> df
5b c…2 a…3 a…4 b…5 c…6 b…7 c…8 b…9 a…10
1 64 c b c b a c c b c
2 ab4 c b b b c c a
3 3a c b b b b a b a b
4 5c c c b c b b b b c
5 a74 c b b b b a c a c
6 a99 c a c a b b c b b
7 aa8 c a a a c
8 a68 c a b b b c c a c
9 b4b c c b b b a c a c
10 b43 c a a b b b b a a
>
> getmode 0]
+ uniqv
>
> for (cols in colnames(df)) {
+ if (cols %in% names(df[,sapply(df, is.numeric)])) {
+ df%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))
+
+ }
+ else {
+
+ df%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)==””, getmode(!!rlang::sym(cols))))
+
+ }
+ }
>
> df
5b c…2 a…3 a…4 b…5 c…6 b…7 c…8 b…9 a…10
1 64 c b c b a c c b c
2 ab4 c b b b c c a
3 3a c b b b b a b a b
4 5c c c b c b b b b c
5 a74 c b b b b a c a c
6 a99 c a c a b b c b b
7 aa8 c a a a c
8 a68 c a b b b c c a c
9 b4b c c b b b a c a c
10 b43 c a a b b b b a a
Reply

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

George Pipis March 21, 2024

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s

George Pipis March 15, 2024

How to Impute Missing Values in R

Share This Post

3 thoughts on “How to Impute Missing Values in R”

Leave a Comment Cancel reply

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Image Captioning with HuggingFace

Intro to Chatbots with HuggingFace

How to Impute Missing Values in R

Share This Post

3 thoughts on “How to Impute Missing Values in R”

Leave a Comment Cancel reply

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Image Captioning with HuggingFace

Intro to Chatbots with HuggingFace

#Tag Cloud ☁️