Working with huge data sets
- Basic R script for this lesson
- Tutorial for this lesson
- R markdown for this lesson
- Data set to demonstrate match()
Working with huge data sets can be a challenge in R for 2 reasons:
- processing huge data sets is slow but you can sometimes speed things up via parallelization
- they don’t fit in the memory of your computer and R typically loads data into RAM
Comparisons of performance of functions
Please check the tutorial and the script for a comparison of the performance of functions for manipulating huge vectors and tables in R. Overall one can conclude that:
- the data.table framework is very efficient for handling large tables
- dplyr functions are not always more performant than base functions, for instance the %in% operator is very performant
- for-loops are extremely slow
What is parallelization?
Parallelization means splitting a big task into smaller parts and running those smaller parts simultaneously (= in parallel), instead of one after the other.
As an example we’ll use the processing of 1000 MS (Mass Spectrometry) spectra (big shout out to Kylie Bemis for the package and the code).
Most people will do this analysis serially: one spectrum after the other.
However, if you need to process millions of spectra, serial analysis would be too slow. You would need to process them in parallel.
Default backends in R
To work either serially or in parallel, you need an application that works behind the scenes to interpret your code and make sure all machinery required for serial or parallel computing is activated. Such applications are called backends. There are:
- serial backends that execute one task at a time on a single core (like the one that handles lapply())
- parallel backends that allow for parallelization across multiple cores like Snow or Multicore
A core is a processing unit in your computer. Your computer has a CPU (Central Processing Unit) that contains multiple cores. Each core handles its own tasks independently, so your computer can run several tasks at the same time, one per core. Think of a CPU as a kitchen, and each core as a chef (more chefs = more dishes cooked at the same time).
There are other backends available:
- Revolution Analytics (formerly known as Revolution R) provide packages like foreach, DoSnow and DoParallel. These packages provide backends that can be created via DoparParam()
- The batchtools package provides a backend that allows for parallelization on HPC (High Performance Computing) clusters. The batchtools backend can be created via BatchToolsParam()
Using a serial backend in R
To tell R that you want to use a particular backend as the default (for all code that will be run in R), you register it using the register() function. It will be set as the global default backend.
The BiocParallel package provides its own version of lapply(), called bpapply() that allows for parallel execution of the function specified by the FUN argument. You create a backend and then pass it to bpapply() via the BPPARAM argument to use it locally (= only for the function that is specified in the bpapply() call and not for the next steps in the code).
Using a parallel backend in R
The bpapply() function allows to use a serial or a parallel backend. If you specify a parallel backend you have to make sure you pass the packages that need to be loaded to the extra RStudio sessions that are generated.
The code for a Multicore backend is identical, except that you:
- create a Multicore backend
- don’t need to use the require argument if the package is already loaded in the main R session.
mc <- MulticoreParam(workers=8)
system.time({
result <- bplapply(spectra, function(s) {
s <- smoothNoise(s)
s <- removeBaseline(s)
findPeaks(s)
},BPPARAM=mc)
})
Running the code will be a lot faster, even faster than with the Snow backend, but Multicore only works on Mac.
If you run into problems with parallelization, it can help to set the Serial backend again and run tasks serially.
Working with huge data sets
Huge data sets create a second issue because they often don’t fit in the memory of your computer.