10 Tips And Tricks For Data Scientists Vol.6

We have started a series of articles on tips and tricks for data scientists (mainly in Python and R). In case you have missed:

Python

1.How To Get The Mode From A List In Python

Assume that we have the following list:

mylist = [1,1,1,2,2,3,3]

and we want to get the mode, i.e. the most frequent element. We can use the following trick using the max and the lambda key.

max(mylist, key = mylist.count)

And we get 1 since this was the mode in our list. In case where there is a draw in the mode and you want to get the minimum or the maximum number, you can sort the list as follows:

mylist = [1,1,1,2,2,3,3,3]
max(sorted(mylist), key = mylist.count)

And we get 1

mylist = [1,1,1,2,2,3,3,3]
max(sorted(mylist, reverse=True), key = mylist.count)

And we get 3

In the above example, we had two modes, 1 and 3. By adding the sorted function we were able to get the min and the max respectively. Finally, this approach works with string elements too.

mylist = ['a','a','b','b','b']
max(mylist, key = mylist.count)

And we get ‘b’

2.How To Disable All Warnings In Python

You can disable all python warnings by running the following code block.

import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

3.How to Copy Data And Paste Into A Pandas DataFrame

Firstly we need to copy the data-frame like data.

Copy Data and Paste into a Pandas DataFrame 1

Then, run the following:

import pandas as pd
pd.read_clipboard()

Copy Data and Paste into a Pandas DataFrame 2

4.How to Save Pandas Dataframes As Images

You can save Pandas Dataframes as images using the library dataframe-image. Then you can run:

import pandas as pd
import dataframe_image as dfi
 
df = pd.DataFrame({'A': [1,2,3,4],
                   'B':['A','B','C','D']})
 
dfi.export(df, 'dataframe.png')

Running the above, it will create a png image file with the dataframe as the image below.

5.How To Add A Progress Bar Into Pandas Apply

When we are applying a function to a big Data-Frame, we can’t see the progress of the function or an estimate on how long remains for the function to be applied to the whole dataset. We can solve this with the use of the tqdm library.

import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
tqdm.pandas()

#dummy data
df=pd.DataFrame({"Value":np.random.normal(size=1500000)})

Let’s apply a simple function to our data but instead of using apply, we will use the progress_apply function.

df['Value'].progress_apply(lambda x: x**2)

R

6.Avoid Apply() Function In Large Datasets

When we are dealing with large datasets and there is a need to calculate some values like the row/column min/max/rank/mean etc we should avoid the apply function because it takes a lot of time. Instead, we can use the matrixStats package and its corresponding functions. Let’s provide some comparisons.

Example of Minimum value per Row

Assume that we want to get the minimum value of each row from a 500 x 500 matrix. Let’s compare the performance of the apply function from the base package versus the rowMins function from the matrixStats package.

library(matrixStats)
library(microbenchmark)
library(ggplot2)
 
x <- matrix( rnorm(5000 * 5000), ncol = 5000 )
 
tm <- microbenchmark(apply(x,1,min),
                     rowMins(x),
                     times = 100L
                    )
 
tm

And we get:

Unit: milliseconds
             expr      min         lq       mean    median        uq       max neval
 apply(x, 1, min) 981.6283 1034.98050 1078.04485 1065.4163 1107.9962 1327.9284   100
       rowMins(x)  42.1838   43.80065   46.55752   45.2255   47.6249   81.3097   100

As we can see from the output above, the apply function was 23 times slower than the rowMins. Below we represent the violin plot

autoplot(tm)

Avoid apply() function in large datasets 1

7.The Fastest Way To Read And Write Files In R

When we are dealing with large datasets, and we need to write many csv files or when the csv filethat we hand to read is huge, then the speed of the read and write command is important. We will compare the required time to write and read files of the following cases:

base package
data.table
readr

Compare the Write times

We will work with a csv file of 1M rows and 10 columns which is approximately 180MB. Let’s create the sample data frame and write it to the hard disk. We will generate 10M observations from the Normal Distribution.

library(data.table)
library(readr)
library(microbenchmark)
library(ggplot2)
 
# create a 1M X 10 data frame
 
my_df<-data.frame(matrix(rnorm(1000000*10), 1000000,10))
 
 
# base
system.time({ write.csv(my_df, "base.csv", row.names=FALSE) })
 
# data.table
system.time({ fwrite(my_df, "datatable.csv") })
 
# readr
system.time({ write_csv(my_df, "readr.csv") })

The fastest way to Read and Write files in R 1

As we can see from the elapsed time, the fwrite from the data.table is ~70 times faster than the base package and ~7times faster than the readr

Compare the Read Times

Let’s compare also the read times using the microbenchmark package.

tm <- microbenchmark(read.csv("datatable.csv"),
                     fread("datatable.csv"),
                     read_csv("datatable.csv"),
                     times = 10L
)
 
tm
autoplot(tm)

The fastest way to Read and Write files in R 2

As we can see, again the fread from the data.table package is around 40 times faster than the base package and 8.5 times faster than the read_csv from the readr package. So, if you want to read and write files fastly then you should choose the data.table package.

8.How To Compare Objects In R

In 2020, the guru Hadley Wickham, built a new package called waldo for comparing complex R objects and making it easy to detect the key differences. You can find detailed examples in tidyverse.org as well as at the github. Let’s provide a simple example by comparing two data frames in R with waldo.

Compare 2 Data Frames in R

Let’s create the new data frames:

library(waldo)
 
 
df1<-data.frame(X=c(1,2,3), Y=c("a","b","c"), A=c(3,4,5))
df2<-data.frame(X=c(1,2,3,4), Y=c("A","b","c","d"), Z=c("k","l","m","n"), A=c("3","4","5","6"))

The df1:

And the df2:

  X Y Z A
1 1 A k 3
2 2 b l 4
3 3 c m 5
4 4 d n 6

Let’s compare them:

waldo::compare(df1,df2)

And we get:

As we can see it captures all the differences and the output is in a friendly format of different colors depending on the difference. More particularly, it captures that:

The df1 has 3 rows where df2 has 2
The df2 has an extra column name and it shows where is the difference in the order.
It shows the rows names and the differences
It compares each column, X, Y, A, Z
It detected the difference in the data type of column A where in the first is double and in the second is character
It returns that there is a new column called Z.

9.How To Install And Load Packages Dynamically

When we share an R script file with someone else, we assumed that they have already installed the required R packages. However, this is not always the case and for that reason, I strongly suggest adding this piece of code to every shared R script which requires a package. Let’s assume that your code requires the following three packages: “readxl“, “dplyr“, “multcomp” .

The script below checks, if the package exists, and if not, then it installs it and finally it loads it to R

mypackages<-c("readxl", "dplyr", "multcomp")
 
for (p in mypackages){
  if(!require(p, character.only = TRUE)){
    install.packages(p)
    library(p, character.only = TRUE)
  }
  }

10.How To Convert All Character Variables To Factors

Let’s say that we want to convert all Character Variables to Factors and we are dealing with a large data frame of many columns which means that is not practical to convert them one by one. Thus, our approach is to detect the “char” variables and to convert them to “Factors”.

Let’s provide a toy example:

df<-data.frame(Gender = c("F", "F", "M","M","F"), 
               Score  = c(80, 70, 65, 85, 95),
               Type = c("A","B","C","B","B"))

Hack: How to Convert all Character Variables to Factors 1

As we can see, the Gender and Type are char variables. Let’s convert them to factors.

df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)

Hack: How to Convert all Character Variables to Factors 2

As we can see, we managed to convert them. Now, you can also rename and relevel the factors. Notice that we could work in the other way around by converting the Factors to Characters. Generally, we can change different data types.

Share This Post

3 thoughts on “10 Tips And Tricks For Data Scientists Vol.6”

Stuart Luppescu

April 26, 2021 at 1:34 am

I think the easier, and probably more efficient way of converting character variables to factors is mutate_if in dplyr:
mutate_if(is.character, as.factor)
Reply
- George Pipis
  
  April 26, 2021 at 11:59 am
  
  Yes I agree, I try to keep it simple without using libraries like dplyr.
  Reply
Natalia Gumińska

May 10, 2021 at 6:36 am

I think the fastest way to read a file in R is using vroom package. It’s much faster than the fread and also memory efficient, as it doesn’t load all of the data into the env at once.
Reply

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

George Pipis March 21, 2024

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s

George Pipis March 15, 2024

10 Tips And Tricks For Data Scientists Vol.6

Python

1.How To Get The Mode From A List In Python

2.How To Disable All Warnings In Python

3.How to Copy Data And Paste Into A Pandas DataFrame

4.How to Save Pandas Dataframes As Images

5.How To Add A Progress Bar Into Pandas Apply

R

6.Avoid Apply() Function In Large Datasets

7.The Fastest Way To Read And Write Files In R

8.How To Compare Objects In R

Compare 2 Data Frames in R

9.How To Install And Load Packages Dynamically

10.How To Convert All Character Variables To Factors

Share This Post

3 thoughts on “10 Tips And Tricks For Data Scientists Vol.6”

Leave a Comment Cancel reply

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Image Captioning with HuggingFace

Intro to Chatbots with HuggingFace

10 Tips And Tricks For Data Scientists Vol.6

Python

1.How To Get The Mode From A List In Python

2.How To Disable All Warnings In Python

3.How to Copy Data And Paste Into A Pandas DataFrame

4.How to Save Pandas Dataframes As Images

5.How To Add A Progress Bar Into Pandas Apply

R

6.Avoid Apply() Function In Large Datasets

7.The Fastest Way To Read And Write Files In R

8.How To Compare Objects In R

Compare 2 Data Frames in R

9.How To Install And Load Packages Dynamically

10.How To Convert All Character Variables To Factors

Share This Post

3 thoughts on “10 Tips And Tricks For Data Scientists Vol.6”

Leave a Comment Cancel reply

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Image Captioning with HuggingFace

Intro to Chatbots with HuggingFace

#Tag Cloud ☁️