One of my favorite parts of data.table is that it provides a collection of helper functions that are essential in many use cases. Sometimes I wonder why such basic functions are not provided by base R in the first place.
Following are some of the helper functions I find most useful in my every day use.
fread()
, fwrite()
These functions are similar with read.csv()
and write.csv()
but they are much faster and more
memory efficient thanks to the highly optimized implementation in C code, and the use of
multi-threading to speed up the process.
The CSV format is a very (maybe the most?) common format for data exchange. It is human readable and language neutral. However, reading CSV properly and efficiently at the same time is not trivial. It is because a random CSV file typically does not have any metadata for the reading program to know the data type of each column. Therefore, the reading program has to guess the data type of each column by reading the first few rows of the file.
I think fread()
finds a nice balance between speed and robustness. However, I suggest using
a format with metadata whenever possible. For example, you can use fst::read_fst()
and fst::write_fst()
for
great performance, or arrow::read_feather()
and arrow::write_feather()
for better
interoperability with other languages.
rbindlist()
This function is similar with the usage of do.call(rbind, list)
which is used to bind a list of
data frames into one data table. The difference is that rbindlist()
is significantly faster and more
memory efficient. Moreover, it can also handle the cases where each data frame has different
columns fill
argument, and where the order of columns is different via use.names
argument.
A typical use case of mine is to read a number of csv files in a directory and bind them into one
data table. The following code shows how to do it with rbindlist()
.
files <- list.files(path = "path/to/csv/files", pattern = "*.csv", full.names = TRUE)
dt <- rbindlist(lapply(files, fread))
The code assumes that each file has exactly the same columns in terms of name and order. If this is
not the case, you can use use.names = TRUE
to make sure that the columns are matched by name. If some
columns are missing in some files, you can use fill = TRUE
to fill the missing columns with NA
automatically.
Therefore, a more “robust” version of the code above is
files <- list.files(path = "path/to/csv/files", pattern = "*.csv", full.names = TRUE)
dt <- rbindlist(lapply(files, fread), fill = TRUE, use.names = TRUE)
However, this will make the code less safe because it will accept any column name in any file, and silently
do everything for you as long as each fread()
returns a valid data table. Sometimes, it might be helpful
to make the code less “robust” so that you can catch the errors early and fix them in the original data
in the first place.
first()
, last()
These functions are basically a shortcut for head()
and tail()
. They are useful when you want to
get the first or last element of a vector, or the first or last row of a data frame. It also
redirects to xts::first()
and xts::last()
if the input is an xts
object.
shift()
This function is perhaps the most useful function in data.table for me when I do time-series operations on data. It works not only with vectors but also with lists, making it easier and faster to shift multiple columns at the same time.
For example, if you want to shift the column x
by one , you can do
library(data.table)
shift(1:10)
## [1] NA 1 2 3 4 5 6 7 8 9
shift()
also works with lists. For example, if you want to shift the columns x
and y
by two, a combination
of shift()
and .SD
will do the trick. This could also scale easily if the same logic should
be applied to more columns.
dt <- data.table(x = 1:10, y = 10:1)
shift_cols <- c("x", "y")
dt[, paste0(shift_cols, "_lag2") := shift(.SD, 2), .SDcols = shift_cols]
dt
## x y x_lag2 y_lag2
## <int> <int> <int> <int>
## 1: 1 10 NA NA
## 2: 2 9 NA NA
## 3: 3 8 1 10
## 4: 4 7 2 9
## 5: 5 6 3 8
## 6: 6 5 4 7
## 7: 7 4 5 6
## 8: 8 3 6 5
## 9: 9 2 7 4
## 10: 10 1 8 3
Moreover, shift()
can also be used to shift a vector by multiple steps. For example, if you want to
shift the column x
by two steps forward and one step backward:
shift(1:10, c(2, -1))
## [[1]]
## [1] NA NA 1 2 3 4 5 6 7 8
##
## [[2]]
## [1] 2 3 4 5 6 7 8 9 10 NA
frank()
This function is similar with base::rank()
but it is much faster and more memory efficient. It is also
more flexible in that it can handle ties in different ways. For example, you can choose to assign
the average rank to the tied values, or assign the minimum rank to the tied values.
Following is a simple demonstration of the speed difference between base::rank()
and data.table::frank()
.
x <- rnorm(2000000)
system.time(rank(x))
## user system elapsed
## 3.331 0.087 4.159
system.time(frank(x))
## user system elapsed
## 0.765 0.068 0.982
fifelse()
, fcase()
fifelse()
is a faster version of base::ifelse()
, and fcase()
is an extended version of
fifelse()
to handle multiple cases.
In fact, data.table provides a rich collection of f*
functions that are similar but much faster
than their base R counterparts, which reflects the philosophy of data.table to be fast and memory
efficient, and takes performance into account in every aspect of the design.
There are also other important functions such as dcast()
and melt()
for reshaping a data table.
I will cover them in a separate post.