Optimal Data Decompression in R

I’ve recently encountered a situation where I am working with very large datasets in a constrained environment. As a result, the only practical option has been to store the datasets in a compressed format, and then load them into R to start on the data analysis.

The problem is that when your working with datasets in the 10gb+ range on a normal desktop loading the data into memory (thankfully there is still enough of that available!) becomes quite time consuming and having to decompress the data first significantly increases this time.

Since I couldn’t find anything on the fastest way to address this problem in R I decided to run some tests of my own. And for anyone that might be interested by hastily put together tests are bellow.

Firstly lets load up all the libraries we need and make some data to work with.

library("rbenchmark")
library(readr)
library(data.table)
library(R.utils)
data <- data.frame(matrix(rbinom(10*1000, 50000, .5), nrow=2000000, ncol=100))
data.table::fwrite(data, "~/Downloads/data.csv")
write.csv(data, file=gzfile("~/Downloads/data.csv.gz"))

This will create a synthetic dataset that is about 1gb in size. Big enough to test with, but not enough to have your laptop overheating.

Before starting on decompressing our data, its worth seeing which programmes are the fastest for just reading our csv’s into R.

benchmark("read.csv" = {
  data <- read.csv('~/Downloads/data.csv')
},
"readr" = {
  data <- read_csv("~/Downloads/data.csv")
  
},
"data.table" = {
  data <- data.table::fread('~/Downloads/data.csv')
},

replications = 3,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self")
)
test replications elapsed relative user.self sys.self
data.table 1 4.198 1.000 6.822 0.938
read.csv 1 112.256 26.740 87.441 7.423
readr 1 38.003 9.053 27.791 3.714

Of which data.table is the clear winner, by a very long way. Indeed, all three of these packages can also read .gz files using exactly the same commands, and the picture it produces is very similar although significantly slower.

benchmark("read.csv" = {
  data <- read.csv('~/Downloads/data.csv.gz')
},
"readr" = {
  data <- read_csv("~/Downloads/data.csv.gz")
  
},
"data.table" = {
  data <- data.table::fread('~/Downloads/data.csv.gz')
},

replications = 3,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self")
)
test replications elapsed relative user.self sys.self
data.table 3 76.602 1.000 29.596 31.980
read.csv 3 438.293 5.722 304.780 34.001
readr 3 274.393 3.582 180.936 42.399

Again data.tables comes out on top, although it is probably not worth paying attention to anything other than the relative times as I wasn’t running these benchmarks under ideal test conditions.

But can we do quicker than data.tables?

Having looked at the source code for all of the above packages nothing really differs amongst them in terms of handling .gz with everything ultimately calling the R.utils decompressFile function. As a result most of data.tables comparative speed comes from its very well written fread function rather than doing anything fancy with the decompression itself.

The natural step in wanting to improve on this time in speeding up the decompression before handing it over the data.tables. I wasn’t exhaustive in trying every option but with a little playing around I found that (for some reason unknown to me) decompressing the file directly with R.utils and then reading the file into data.tables.

Beyond this using an external programs can speed things up. (Pigz)[https://zlib.net/pigz/] is a command line multi-processor implementation of the standard zlib libraries which we can easily call in R to speed things up. While Pigz can’t make use of multi-processing in the actual decompression (apparently this is a hard limit with .gz as file type) there are a number of other tasks in this process that can make use of multi-processing. And, as a result we see a small performance increase over R.utils.

benchmark("R.utils" = {
  R.utils::decompressFile('~/Downloads/data.csv.gz',
                          '~/Downloads/data.csv',
                          ext="gz", FUN=gzfile,
                          remove = F,
                          overwrite = T)
  data <- data.table::fread("~/Downloads/data.csv")
},

"data.table" = {
  data <- data.table::fread("~/Downloads/data.csv.gz")
},

"pigz" = {
  system("pigz -d -f -k  ~/Downloads/data.csv.gz")
  data <- data.table::fread("~/Downloads/data.csv")
},

replications = 3,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self")
)
test replications elapsed relative user.self sys.self
data.table 3 69.755 1.520 28.959 30.521
pigz 3 45.896 1.000 25.068 9.452
R.utils 3 63.927 1.393 29.042 27.600

Of course improvements could be made by switching the compression type etc etc… but for now I’m happy enough with the improvements these simple changes have made from where I started out with using readr.

mail@timothymonteath.com
@timothymonteath