Predictive Hacks

Tidyverse Tips

tidyverse

I have found the following commands quite useful during the EDA part of any Data Science project. We will work with the tidyverse package where we will actually need the dplyr and the ggplot2 only and with the irisdataset.

select_if | rename_if

The select_if function belongs to dplyr and is very useful where we want to choose some columns based on some conditions. We can also add a function that applies to column names.

Example: Let’s say that I want to choose only the numeric variables and to add the prefix “numeric_” to their column names.

library(tidyverse)

iris%>%select_if(is.numeric,  list(~ paste0("numeric_", .)))%>%head()
 

Output:

Note that we can also use the rename_if in the same way. An important note is that the rename_if(), rename_at(), and rename_all() have been superseded by rename_with(). The matching select statements have been superseded by the combination of a select() + rename_with().

These functions were superseded because mutate_if() and friends were superseded by across(). select_if() and rename_if() already use tidy selection so they can’t be replaced by across() and instead we need a new function.


where

We can select or rename columns using the where by selecting the variables for which a function returns TRUE. We will work with the same examples as above.

Example: Let’s say that I want to choose only the numeric variables and to add the prefix “numeric_” to their column names.

iris%>%rename_with(~ paste0("numeric_", .), where(is.numeric))%>%
       select(where(is.numeric))%>%head()
 

Output:


everything

In many Data Science projects, we want one particular column (usually the dependent variable y) to appear first or last in the dataset. We can achieve this using the everything() from dplyr package.

Example: Let’s say that I want the column Species to appear first in my dataset.

mydataset<-iris%>%select(Species, everything())
mydataset%>%head()
 

Output:


Example: Let’s say that I want the column Species to appear last in my dataset.

This is a little bit tricky. Have a look below at how we can do it. We will work with the mydataset where the Species column appears first and we will remove it to the last column.

mydataset%>%select(-Species, everything())%>%head()
 

Output:


relocate

The relocate() is a new addition in dplyr 1.0.0. You can specify exactly where to put the columns with .before or .after

Example: Let’s say that I want the Petal.Width column to appear next to Sepal.Width

iris%>%relocate(Petal.Width, .after=Sepal.Width)%>%head()

Output:

Notice that we can also set to appear after the last column.

Example: Let’s say that I want the Petal.Width to be the last column

iris%>%relocate(Petal.Width, .after=last_col())%>%head()
 

Output:

You can find more info in the tidyverse documentation


pull

When we work with data frames and we select a single column, sometimes we the output to be as.vector. We can achieve this with the pull() which is part of dplyr.

Example: Let’s say that I want to run a t.test in the Sepal.Length for setosa versus virginica. Note the the t.test function expects numeric vectors.

setosa_sepal_length<-iris%>%filter(Species=='setosa')%>%select(Sepal.Length)%>%pull()
virginica_sepal_length<-iris%>%filter(Species=='virginica')%>%select(Sepal.Length)%>%pull()

t.test(setosa_sepal_length,virginica_sepal_length)
 

Output:



reorder

When you work with ggplot2 sometimes is frustrating when you have to reorder the factors based on some conditions. Let’s say that we want to show the boxplot of the Sepal.Width by Species.

iris%>%ggplot(aes(x=Species, y=Sepal.Width))+geom_boxplot()
 

Output:


Example: Let’s assume that we want to reorder the boxplot based on the Species’ median.

We can do that easily with the reorder() from the stats package.

iris%>%ggplot(aes(x=reorder(Species,Sepal.Width, FUN = median), y=Sepal.Width))+geom_boxplot()+xlab("Species")
 

Output:

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

2 thoughts on “Tidyverse Tips”

  1. Scoped functions (e.g. “select_if()”) have been superseded in dplyr. Check following example:

    library(tidyverse)
    table1 %>% select(where(is.integer))
    table1 %>% rename_with(~str_c(“numeric_”,.x), where(is.integer))

    Reply

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s