5 min read

Learning data.table: helper functions

One of my favorite parts of data.table is that it provides a collection of helper functions that are essential in many use cases. Sometimes I wonder why such basic functions are not provided by base R in the first place.

Following are some of the helper functions I find most useful in my every day use.

fread(), fwrite()

These functions are similar with read.csv() and write.csv() but they are much faster and more memory efficient thanks to the highly optimized implementation in C code, and the use of multi-threading to speed up the process.

The CSV format is a very (maybe the most?) common format for data exchange. It is human readable and language neutral. However, reading CSV properly and efficiently at the same time is not trivial. It is because a random CSV file typically does not have any metadata for the reading program to know the data type of each column. Therefore, the reading program has to guess the data type of each column by reading the first few rows of the file.

I think fread() finds a nice balance between speed and robustness. However, I suggest using a format with metadata whenever possible. For example, you can use fst::read_fst() and fst::write_fst() for great performance, or arrow::read_feather() and arrow::write_feather() for better interoperability with other languages.

rbindlist()

This function is similar with the usage of do.call(rbind, list) which is used to bind a list of data frames into one data table. The difference is that rbindlist() is significantly faster and more memory efficient. Moreover, it can also handle the cases where each data frame has different columns fill argument, and where the order of columns is different via use.names argument.

A typical use case of mine is to read a number of csv files in a directory and bind them into one data table. The following code shows how to do it with rbindlist().

files <- list.files(path = "path/to/csv/files", pattern = "*.csv", full.names = TRUE)
dt <- rbindlist(lapply(files, fread))

The code assumes that each file has exactly the same columns in terms of name and order. If this is not the case, you can use use.names = TRUE to make sure that the columns are matched by name. If some columns are missing in some files, you can use fill = TRUE to fill the missing columns with NA automatically.

Therefore, a more “robust” version of the code above is

files <- list.files(path = "path/to/csv/files", pattern = "*.csv", full.names = TRUE)
dt <- rbindlist(lapply(files, fread), fill = TRUE, use.names = TRUE)

However, this will make the code less safe because it will accept any column name in any file, and silently do everything for you as long as each fread() returns a valid data table. Sometimes, it might be helpful to make the code less “robust” so that you can catch the errors early and fix them in the original data in the first place.

first(), last()

These functions are basically a shortcut for head() and tail(). They are useful when you want to get the first or last element of a vector, or the first or last row of a data frame. It also redirects to xts::first() and xts::last() if the input is an xts object.

shift()

This function is perhaps the most useful function in data.table for me when I do time-series operations on data. It works not only with vectors but also with lists, making it easier and faster to shift multiple columns at the same time.

For example, if you want to shift the column x by one , you can do

library(data.table)
shift(1:10)
##  [1] NA  1  2  3  4  5  6  7  8  9

shift() also works with lists. For example, if you want to shift the columns x and y by two, a combination of shift() and .SD will do the trick. This could also scale easily if the same logic should be applied to more columns.

dt <- data.table(x = 1:10, y = 10:1)
shift_cols <- c("x", "y")
dt[, paste0(shift_cols, "_lag2") := shift(.SD, 2), .SDcols = shift_cols]
dt
##         x     y x_lag2 y_lag2
##     <int> <int>  <int>  <int>
##  1:     1    10     NA     NA
##  2:     2     9     NA     NA
##  3:     3     8      1     10
##  4:     4     7      2      9
##  5:     5     6      3      8
##  6:     6     5      4      7
##  7:     7     4      5      6
##  8:     8     3      6      5
##  9:     9     2      7      4
## 10:    10     1      8      3

Moreover, shift() can also be used to shift a vector by multiple steps. For example, if you want to shift the column x by two steps forward and one step backward:

shift(1:10, c(2, -1))
## [[1]]
##  [1] NA NA  1  2  3  4  5  6  7  8
## 
## [[2]]
##  [1]  2  3  4  5  6  7  8  9 10 NA

frank()

This function is similar with base::rank() but it is much faster and more memory efficient. It is also more flexible in that it can handle ties in different ways. For example, you can choose to assign the average rank to the tied values, or assign the minimum rank to the tied values.

Following is a simple demonstration of the speed difference between base::rank() and data.table::frank().

x <- rnorm(2000000)
system.time(rank(x))
##    user  system elapsed 
##   3.331   0.087   4.159
system.time(frank(x))
##    user  system elapsed 
##   0.765   0.068   0.982

fifelse(), fcase()

fifelse() is a faster version of base::ifelse(), and fcase() is an extended version of fifelse() to handle multiple cases.

In fact, data.table provides a rich collection of f* functions that are similar but much faster than their base R counterparts, which reflects the philosophy of data.table to be fast and memory efficient, and takes performance into account in every aspect of the design.

There are also other important functions such as dcast() and melt() for reshaping a data table. I will cover them in a separate post.