Predictive Hacks

How to Remove the Correlated Variables from a Data Frame

When we build predictive models, we use to remove the high correlated variables (multi-collinearity). The point is to keep on of the two correlated variables. Let’s see how we can do it in R by taking as an example the independent variables of the iris dataset.

Get the correlation matrix of the IVs of iris dataset:

df<-iris[, c(1:4)]

cor(df)

Output:

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

As we can see there are some variables that are highly correlated. Let’s say that we want to remove all the variables which have an absolute correlation greater than a threshold, let’s say 80% in our case. First, we need to get the correlation of each pair but counting each pair once.


Var1<-NULL
Var2<-NULL
Correlation<-NULL

for (i in 1:ncol(df)) {
  for (j in 1:ncol(df)) {
    if (i>j) {
      Var1<-c(Var1,names(df)[i])
      Var2<-c(Var2,names(df)[j])
      Correlation<-c(Correlation, cor(df[,i], df[,j]))
      
    }
  }
}

output<-data.frame(Var1=Var1, Var2=Var2, Correlation=Correlation)
output

Output:

          Var1         Var2 Correlation
1  Sepal.Width Sepal.Length  -0.1175698
2 Petal.Length Sepal.Length   0.8717538
3 Petal.Length  Sepal.Width  -0.4284401
4  Petal.Width Sepal.Length   0.8179411
5  Petal.Width  Sepal.Width  -0.3661259
6  Petal.Width Petal.Length   0.9628654

Let’s remove one of the two variables for each pair which has an absolute correlation greater than 80%.

threshold<-0.8

exclude<-unique(output[abs(output$Correlation)>=threshold,'Var2'])

reduced<-df[, !names(df)%in%exclude]

head(reduced)

Output:

  Sepal.Width Petal.Width
1         3.5         0.2
2         3.0         0.2
3         3.2         0.2
4         3.1         0.2
5         3.6         0.2
6         3.9         0.4

Let’s also get the correlation of the reduced data frame.

cor(reduced)

Output:

            Sepal.Width Petal.Width
Sepal.Width   1.0000000  -0.3661259
Petal.Width  -0.3661259   1.0000000

As we can see we removed the correlated variables and we left with 2 IVs instead of 4.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.