Predictive Hacks

Avoid apply() function in large datasets

apply

When we are dealing with large datasets and there is a need to calculate some values like the row/column min/max/rank/mean etc we should avoid the apply function because it takes a lot of time. Instead, we can use the matrixStats package and its corresponding functions. Let’s provide some comparisons.

Example of Minimum value per Row

Assume that we want to get the minimum value of each row from a 500 x 500 matrix. Let’s compare the performance of the apply function from the base package versus the rowMins function from the matrixStats package.

library(matrixStats)
library(microbenchmark)
library(ggplot2)

x <- matrix( rnorm(5000 * 5000), ncol = 5000 )

tm <- microbenchmark(apply(x,1,min),
                     rowMins(x),
                     times = 100L
                    )

tm
 
Unit: milliseconds
             expr      min         lq       mean    median        uq       max neval
 apply(x, 1, min) 981.6283 1034.98050 1078.04485 1065.4163 1107.9962 1327.9284   100
       rowMins(x)  42.1838   43.80065   46.55752   45.2255   47.6249   81.3097   100

As we can see from the output above, the apply function was 23 times slower than the rowMins. Below we represent the violin plot

autoplot(tm)
 

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.