Predictive Hacks

# How to Remove the Correlated Variables from a Data Frame

When we build predictive models, we use to remove the high correlated variables (multi-collinearity). The point is to keep on of the two correlated variables. Let’s see how we can do it in R by taking as an example the independent variables of the `iris` dataset.

Get the correlation matrix of the IVs of iris dataset:

```df<-iris[, c(1:4)]

cor(df)
```

Output:

```             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
```

As we can see there are some variables that are highly correlated. Let’s say that we want to remove all the variables which have an absolute correlation greater than a threshold, let’s say 80% in our case. First, we need to get the correlation of each pair but counting each pair once.

```
Var1<-NULL
Var2<-NULL
Correlation<-NULL

for (i in 1:ncol(df)) {
for (j in 1:ncol(df)) {
if (i>j) {
Var1<-c(Var1,names(df)[i])
Var2<-c(Var2,names(df)[j])
Correlation<-c(Correlation, cor(df[,i], df[,j]))

}
}
}

output<-data.frame(Var1=Var1, Var2=Var2, Correlation=Correlation)
output
```

Output:

```          Var1         Var2 Correlation
1  Sepal.Width Sepal.Length  -0.1175698
2 Petal.Length Sepal.Length   0.8717538
3 Petal.Length  Sepal.Width  -0.4284401
4  Petal.Width Sepal.Length   0.8179411
5  Petal.Width  Sepal.Width  -0.3661259
6  Petal.Width Petal.Length   0.9628654
```

Let’s remove one of the two variables for each pair which has an absolute correlation greater than 80%.

```threshold<-0.8

exclude<-unique(output[abs(output\$Correlation)>=threshold,'Var2'])

reduced<-df[, !names(df)%in%exclude]

```

Output:

```  Sepal.Width Petal.Width
1         3.5         0.2
2         3.0         0.2
3         3.2         0.2
4         3.1         0.2
5         3.6         0.2
6         3.9         0.4
```

Let’s also get the correlation of the reduced data frame.

```cor(reduced)
```

Output:

```            Sepal.Width Petal.Width
Sepal.Width   1.0000000  -0.3661259
Petal.Width  -0.3661259   1.0000000
```

As we can see we removed the correlated variables and we left with 2 IVs instead of 4.

### Get updates and learn from the best

Python

#### Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

#### Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s