Learning data.table: the syntax and its design

In my every day work, data.table has been playing a crucial role to work with tabular data for more than 8 years. It has its pros and cons in terms of syntax and extensibility, yet I don’t find a better alternative to it when it comes to handling large in-memory tables with great performance.

For beginners, it might be a bit challenging to learn the syntax of data.table in the first place. In this post, I will try to motivate and explain the syntax and its design in a simple way.

If you are an absolute beginner and only know a little base R syntax, your data wrangling could look like this:

View the first few rows of the built-in mtcars dataset:

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Filter mtcars so that we only keep the cars with mpg greater than its 80% quantile:

mtcars[mtcars$mpg > quantile(mtcars$mpg, 0.8), ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Select the first 5 rows and only keep a few columns:

mtcars[1:5, c("mpg", "cyl", "disp", "hp")]

##                    mpg cyl disp  hp
## Mazda RX4         21.0   6  160 110
## Mazda RX4 Wag     21.0   6  160 110
## Datsun 710        22.8   4  108  93
## Hornet 4 Drive    21.4   6  258 110
## Hornet Sportabout 18.7   8  360 175

Sort the rows by mpg in descending order, only keep a few columns, and view the top five rows:

mtcars[order(-mtcars$mpg), c("mpg", "cyl", "disp", "hp")][1:5, ]

##                 mpg cyl disp  hp
## Toyota Corolla 33.9   4 71.1  65
## Fiat 128       32.4   4 78.7  66
## Honda Civic    30.4   4 75.7  52
## Lotus Europa   30.4   4 95.1 113
## Fiat X1-9      27.3   4 79.0  66

The base R syntax data[i, j] is quite intuitive and easy to understand:

i: filter rows by index or condition or reorder rows
j: select columns by index or name

However, if you want to do something more complex, the syntax will be more complicated too.

For example, what is the average mpg of cars with each number of cyl? If only base R functions are allowed, you might write something like this:

tapply(mtcars$mpg, mtcars$cyl, mean)

##        4        6        8 
## 26.66364 19.74286 15.10000

Yes, tapply does the trick of grouping. It discovers that mtcars$cyl has 3 unique values and calculates the average mpg for each group.

What about for each cyl value, find the value of disp of the car with the highest mpg? Again, you might write something like this:

tapply(seq_len(nrow(mtcars)), mtcars$cyl, function(i) {
  mtcars$disp[which.max(mtcars$mpg[i])]
})

##     4     6     8 
## 225.0 108.0 275.8

Maybe there is a better way to do this in base R, but it will certainly take some time for a stranger to understand its intention when the above code is provided. It is hard to write because you have to twist your brain to think in a lower-level way of how vectors involved should be carefully subsetted. As a result, those nested vector operations work but at the expense of readability.

If we use data.table, first we need to attach the library and convert mtcars to a data.table so that data.table syntax can be used:

library(data.table)
mtcars1 <- as.data.table(mtcars)

Now we write the following code to do the same job:

mtcars1[order(-mpg), .(disp = first(disp)), by = cyl]

##      cyl  disp
##    <num> <num>
## 1:     4  71.1
## 2:     6 258.0
## 3:     8 400.0

The syntax can be read: order the data by mpg in descending order, select the first value of disp by each cyl group. This is exactly what we want.

The .() call seems a bit quirky at first glance. It is basically a shortcut for list().

Wait, the results are different? Something must be wrong.

We just made a subtle mistake. In the base R code, which.max(mtcars$mpg[i]) returns a vector of the same length of i as a subset of mtcars row numbers grouped by cyl, we must do the same subsetting with mtcars$disp so that the indexing is consistent.

tapply(seq_len(nrow(mtcars)), mtcars$cyl, function(i) {
  mtcars$disp[i][which.max(mtcars$mpg[i])]
})

##     4     6     8 
##  71.1 258.0 400.0

Now it yields a consistent result. As we write some more verbose and less expressive code, it is much more likely to make mistakes.

Now we can take a closer look at the data.table syntax and see how it is a fundemental improvement over base R syntax. The syntax is as simple as the following:

data[i, j, by]

i: filter or order rows.
j: any expression to be evaluated on the filtered/ordered rows in each by group.
by: group by variables.

In data.table syntax, i, j, and by all are evaluated in the scope of the supplied data so that the variable name of the data itself (e.g. mtcars) is no longer needed. The syntax is simple yet powerful enough to handle many data wrangling tasks in a concise way. In most cases, complicated tasks could be broken down into smaller pieces where each could be handled by easier steps.

Following are some more examples on mtcars1 to demonstrate how flexible the syntax is.

Find the average and sd of mpg for each cyl group:

mtcars1[, .(
  mpg_mean = mean(mpg),
  mpg_sd = sd(mpg)
), by = cyl]

##      cyl mpg_mean   mpg_sd
##    <num>    <num>    <num>
## 1:     6 19.74286 1.453567
## 2:     4 26.66364 4.509828
## 3:     8 15.10000 2.560048

Find the average and sd of top 3 mpg values for each cyl group:

mtcars1[order(-mpg), .(
  mpg_top3_mean = mean(head(mpg, 3)),
  mpg_top3_sd = mean(head(mpg, 3))
), by = cyl]

##      cyl mpg_top3_mean mpg_top3_sd
##    <num>         <num>       <num>
## 1:     4      32.23333    32.23333
## 2:     6      21.13333    21.13333
## 3:     8      18.40000    18.40000

The top 3 values of mpg are calculated twice. We could rewrite the code so that it is only calculated once:

mtcars1[order(-mpg), {
  mpg_top3 <- head(mpg, 3)
  .(
    mpg_top3_mean = mean(mpg_top3),
    mpg_top3_sd = sd(mpg_top3)
  )
}, by = cyl]

##      cyl mpg_top3_mean mpg_top3_sd
##    <num>         <num>       <num>
## 1:     4      32.23333   1.7559423
## 2:     6      21.13333   0.2309401
## 3:     8      18.40000   0.9848858

We actually supplied a customized j expression above. Since by= is used here, data.table will try to return a new data.table where the content of each row comes from the result of j.

To know more about how j is evaluated, we could even insert some logging calls in j.

mtcars1[order(-mpg), {
  cat("cyl ", cyl, ", mpg: ", toString(mpg), "\n", sep = "")
  mpg_top3 <- head(mpg, 3)
  .(
    mpg_top3_mean = mean(mpg_top3),
    mpg_top3_sd = sd(mpg_top3)
  )
}, by = cyl]

## cyl 4, mpg: 33.9, 32.4, 30.4, 30.4, 27.3, 26, 24.4, 22.8, 22.8, 21.5, 21.4
## cyl 6, mpg: 21.4, 21, 21, 19.7, 19.2, 18.1, 17.8
## cyl 8, mpg: 19.2, 18.7, 17.3, 16.4, 15.8, 15.5, 15.2, 15.2, 15, 14.7, 14.3, 13.3, 10.4, 10.4

##      cyl mpg_top3_mean mpg_top3_sd
##    <num>         <num>       <num>
## 1:     4      32.23333   1.7559423
## 2:     6      21.13333   0.2309401
## 3:     8      18.40000   0.9848858

The messages printed above demonstrates the order of execution for each group. In fact, tons of things just happened under the hood, but we don’t need to know those details at the moment. When i and j are supplied, all we need is to make j work for one group.

Do a linear regression (mpg ~ cyl + disp + hp) in each vs group and show the coefficients:

mtcars1[, {
  m <- lm(mpg ~ cyl + disp + hp)
  coef(m)
}, by = vs]

##       vs          V1
##    <num>       <num>
## 1:     0 32.56839973
## 2:     0 -1.22126557
## 3:     0 -0.01520321
## 4:     0 -0.01154542
## 5:     1 38.05114111
## 6:     1  0.50676084
## 7:     1 -0.05396641
## 8:     1 -0.09481889

For each vs group, the result of coef(m) is a simple numeric vector of length 4. The result will be regarded as 4 rows. The names of coef(m) will be discarded as there is no place to hold them.

To make the this row-wise coefficient table look better, we could rewrite the code as following:

mtcars1[, {
  m <- lm(mpg ~ cyl + disp + hp)
  beta <- coef(m)
  .(x = names(beta), beta = beta)
}, by = vs]

##       vs           x        beta
##    <num>      <char>       <num>
## 1:     0 (Intercept) 32.56839973
## 2:     0         cyl -1.22126557
## 3:     0        disp -0.01520321
## 4:     0          hp -0.01154542
## 5:     1 (Intercept) 38.05114111
## 6:     1         cyl  0.50676084
## 7:     1        disp -0.05396641
## 8:     1          hp -0.09481889

An alternative, if not better, way to do this is simply convert the result to a list:

mtcars1[, {
  m <- lm(mpg ~ cyl + disp + hp)
  as.list(coef(m))
}, by = vs]

##       vs (Intercept)        cyl        disp          hp
##    <num>       <num>      <num>       <num>       <num>
## 1:     0    32.56840 -1.2212656 -0.01520321 -0.01154542
## 2:     1    38.05114  0.5067608 -0.05396641 -0.09481889

In this example, we do a linear regression and extract the coefficients. Note that we convert the result of coef(m) from a numeric vector to a list so that the cofficients become columns instead of rows.

We can also extend the result by including some summary statistics reported in summary(m):

mtcars1[, {
  m <- lm(mpg ~ cyl + disp + hp)
  sm <- summary(m)
  c(as.list(coef(m)), r2 = sm$r.squared, adj_r2 = sm$adj.r.squared)
}, by = vs]

##       vs (Intercept)        cyl        disp          hp        r2    adj_r2
##    <num>       <num>      <num>       <num>       <num>     <num>     <num>
## 1:     0    32.56840 -1.2212656 -0.01520321 -0.01154542 0.7535160 0.7006980
## 2:     1    38.05114  0.5067608 -0.05396641 -0.09481889 0.6825405 0.5873026

From the above examples, we can see that data.table syntax is extremely flexible and powerful. In fact, it is only a basic part of the features it has to offer. If we know more about key/index, reshaping, special symbols like .SD, we can do a lot more with data.table.