One of my favorite parts of data.table is that it provides a collection of helper functions that are essential in many use cases. Sometimes I wonder why such basic functions are not provided by base R in the first place.
Following are some of the helper functions I find most useful in my every day use.
These functions are similar with
write.csv() but they are much faster and more
memory efficient thanks to the highly optimized implementation in C code, and the use of
multi-threading to speed up the process.
The CSV format is a very (maybe the most?) common format for data exchange. It is human readable and language neutral. However, reading CSV properly and efficiently at the same time is not trivial. It is because a random CSV file typically does not have any metadata for the reading program to know the data type of each column. Therefore, the reading program has to guess the data type of each column by reading the first few rows of the file.
fread() finds a nice balance between speed and robustness. However, I suggest using
a format with metadata whenever possible. For example, you can use
great performance, or
arrow::write_feather() for better
interoperability with other languages.
This function is similar with the usage of
do.call(rbind, list) which is used to bind a list of
data frames into one data table. The difference is that
rbindlist() is significantly faster and more
memory efficient. Moreover, it can also handle the cases where each data frame has different
fill argument, and where the order of columns is different via
A typical use case of mine is to read a number of csv files in a directory and bind them into one
data table. The following code shows how to do it with
files <- list.files(path = "path/to/csv/files", pattern = "*.csv", full.names = TRUE) dt <- rbindlist(lapply(files, fread))
The code assumes that each file has exactly the same columns in terms of name and order. If this is
not the case, you can use
use.names = TRUE to make sure that the columns are matched by name. If some
columns are missing in some files, you can use
fill = TRUE to fill the missing columns with
Therefore, a more “robust” version of the code above is
files <- list.files(path = "path/to/csv/files", pattern = "*.csv", full.names = TRUE) dt <- rbindlist(lapply(files, fread), fill = TRUE, use.names = TRUE)
However, this will make the code less safe because it will accept any column name in any file, and silently
do everything for you as long as each
fread() returns a valid data table. Sometimes, it might be helpful
to make the code less “robust” so that you can catch the errors early and fix them in the original data
in the first place.
These functions are basically a shortcut for
tail(). They are useful when you want to
get the first or last element of a vector, or the first or last row of a data frame. It also
xts::last() if the input is an
This function is perhaps the most useful function in data.table for me when I do time-series operations on data. It works not only with vectors but also with lists, making it easier and faster to shift multiple columns at the same time.
For example, if you want to shift the column
x by one , you can do
##  NA 1 2 3 4 5 6 7 8 9
shift() also works with lists. For example, if you want to shift the columns
y by two, a combination
.SD will do the trick. This could also scale easily if the same logic should
be applied to more columns.
dt <- data.table(x = 1:10, y = 10:1) shift_cols <- c("x", "y") dt[, paste0(shift_cols, "_lag2") := shift(.SD, 2), .SDcols = shift_cols] dt
## x y x_lag2 y_lag2 ## <int> <int> <int> <int> ## 1: 1 10 NA NA ## 2: 2 9 NA NA ## 3: 3 8 1 10 ## 4: 4 7 2 9 ## 5: 5 6 3 8 ## 6: 6 5 4 7 ## 7: 7 4 5 6 ## 8: 8 3 6 5 ## 9: 9 2 7 4 ## 10: 10 1 8 3
shift() can also be used to shift a vector by multiple steps. For example, if you want to
shift the column
x by two steps forward and one step backward:
shift(1:10, c(2, -1))
## [] ##  NA NA 1 2 3 4 5 6 7 8 ## ## [] ##  2 3 4 5 6 7 8 9 10 NA
This function is similar with
base::rank() but it is much faster and more memory efficient. It is also
more flexible in that it can handle ties in different ways. For example, you can choose to assign
the average rank to the tied values, or assign the minimum rank to the tied values.
Following is a simple demonstration of the speed difference between
x <- rnorm(2000000) system.time(rank(x))
## user system elapsed ## 3.331 0.087 4.159
## user system elapsed ## 0.765 0.068 0.982
fifelse() is a faster version of
fcase() is an extended version of
fifelse() to handle multiple cases.
In fact, data.table provides a rich collection of
f* functions that are similar but much faster
than their base R counterparts, which reflects the philosophy of data.table to be fast and memory
efficient, and takes performance into account in every aspect of the design.
There are also other important functions such as
melt() for reshaping a data table.
I will cover them in a separate post.