In my every day work, data.table has been playing a crucial role to work with tabular data for more than 8 years. It has its pros and cons in terms of syntax and extensibility, yet I don’t find a better alternative to it when it comes to handling large in-memory tables with great performance.
For beginners, it might be a bit challenging to learn the syntax of data.table
in the first place.
In this post, I will try to motivate and explain the syntax and its design in a simple way.
If you are an absolute beginner and only know a little base R syntax, your data wrangling could look like this:
- View the first few rows of the built-in
mtcars
dataset:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
- Filter
mtcars
so that we only keep the cars withmpg
greater than its 80% quantile:
mtcars[mtcars$mpg > quantile(mtcars$mpg, 0.8), ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
- Select the first 5 rows and only keep a few columns:
mtcars[1:5, c("mpg", "cyl", "disp", "hp")]
## mpg cyl disp hp
## Mazda RX4 21.0 6 160 110
## Mazda RX4 Wag 21.0 6 160 110
## Datsun 710 22.8 4 108 93
## Hornet 4 Drive 21.4 6 258 110
## Hornet Sportabout 18.7 8 360 175
- Sort the rows by
mpg
in descending order, only keep a few columns, and view the top five rows:
mtcars[order(-mtcars$mpg), c("mpg", "cyl", "disp", "hp")][1:5, ]
## mpg cyl disp hp
## Toyota Corolla 33.9 4 71.1 65
## Fiat 128 32.4 4 78.7 66
## Honda Civic 30.4 4 75.7 52
## Lotus Europa 30.4 4 95.1 113
## Fiat X1-9 27.3 4 79.0 66
The base R syntax data[i, j]
is quite intuitive and easy to understand:
i
: filter rows by index or condition or reorder rowsj
: select columns by index or name
However, if you want to do something more complex, the syntax will be more complicated too.
For example, what is the average mpg
of cars with each number of cyl
? If only base R
functions are allowed, you might write something like this:
tapply(mtcars$mpg, mtcars$cyl, mean)
## 4 6 8
## 26.66364 19.74286 15.10000
Yes, tapply
does the trick of grouping. It discovers that mtcars$cyl
has 3 unique values and
calculates the average mpg
for each group.
What about for each cyl
value, find the value of disp
of the car with the highest mpg
?
Again, you might write something like this:
tapply(seq_len(nrow(mtcars)), mtcars$cyl, function(i) {
mtcars$disp[which.max(mtcars$mpg[i])]
})
## 4 6 8
## 225.0 108.0 275.8
Maybe there is a better way to do this in base R, but it will certainly take some time for a stranger to understand its intention when the above code is provided. It is hard to write because you have to twist your brain to think in a lower-level way of how vectors involved should be carefully subsetted. As a result, those nested vector operations work but at the expense of readability.
If we use data.table, first we need to attach the library and convert mtcars
to a data.table
so that data.table syntax can be used:
library(data.table)
mtcars1 <- as.data.table(mtcars)
Now we write the following code to do the same job:
mtcars1[order(-mpg), .(disp = first(disp)), by = cyl]
## cyl disp
## <num> <num>
## 1: 4 71.1
## 2: 6 258.0
## 3: 8 400.0
The syntax can be read: order the data by mpg
in descending order,
select the first value of disp
by each cyl
group. This is exactly what we want.
The .()
call seems a bit quirky at first glance. It is basically a shortcut for list()
.
Wait, the results are different? Something must be wrong.
We just made a subtle mistake. In the base R code, which.max(mtcars$mpg[i])
returns a vector of
the same length of i
as a subset of mtcars row numbers grouped by cyl
,
we must do the same subsetting with mtcars$disp
so that the indexing is consistent.
tapply(seq_len(nrow(mtcars)), mtcars$cyl, function(i) {
mtcars$disp[i][which.max(mtcars$mpg[i])]
})
## 4 6 8
## 71.1 258.0 400.0
Now it yields a consistent result. As we write some more verbose and less expressive code, it is much more likely to make mistakes.
Now we can take a closer look at the data.table syntax and see how it is a fundemental improvement over base R syntax. The syntax is as simple as the following:
data[i, j, by]
i
: filter or order rows.j
: any expression to be evaluated on the filtered/ordered rows in eachby
group.by
: group by variables.
In data.table syntax, i
, j
, and by
all are evaluated in the scope of the supplied data so that the
variable name of the data itself (e.g. mtcars
) is no longer needed. The syntax is simple yet powerful
enough to handle many data wrangling tasks in a concise way. In most cases, complicated tasks could
be broken down into smaller pieces where each could be handled by easier steps.
Following are some more examples on mtcars1
to demonstrate how flexible the syntax is.
- Find the average and sd of
mpg
for eachcyl
group:
mtcars1[, .(
mpg_mean = mean(mpg),
mpg_sd = sd(mpg)
), by = cyl]
## cyl mpg_mean mpg_sd
## <num> <num> <num>
## 1: 6 19.74286 1.453567
## 2: 4 26.66364 4.509828
## 3: 8 15.10000 2.560048
- Find the average and sd of top 3
mpg
values for eachcyl
group:
mtcars1[order(-mpg), .(
mpg_top3_mean = mean(head(mpg, 3)),
mpg_top3_sd = mean(head(mpg, 3))
), by = cyl]
## cyl mpg_top3_mean mpg_top3_sd
## <num> <num> <num>
## 1: 4 32.23333 32.23333
## 2: 6 21.13333 21.13333
## 3: 8 18.40000 18.40000
The top 3 values of mpg
are calculated twice. We could rewrite the code so that it is only
calculated once:
mtcars1[order(-mpg), {
mpg_top3 <- head(mpg, 3)
.(
mpg_top3_mean = mean(mpg_top3),
mpg_top3_sd = sd(mpg_top3)
)
}, by = cyl]
## cyl mpg_top3_mean mpg_top3_sd
## <num> <num> <num>
## 1: 4 32.23333 1.7559423
## 2: 6 21.13333 0.2309401
## 3: 8 18.40000 0.9848858
We actually supplied a customized j
expression above. Since by=
is used here, data.table will try
to return a new data.table where the content of each row comes from the result of j
.
To know more about how j
is evaluated, we could even insert some logging calls in j
.
mtcars1[order(-mpg), {
cat("cyl ", cyl, ", mpg: ", toString(mpg), "\n", sep = "")
mpg_top3 <- head(mpg, 3)
.(
mpg_top3_mean = mean(mpg_top3),
mpg_top3_sd = sd(mpg_top3)
)
}, by = cyl]
## cyl 4, mpg: 33.9, 32.4, 30.4, 30.4, 27.3, 26, 24.4, 22.8, 22.8, 21.5, 21.4
## cyl 6, mpg: 21.4, 21, 21, 19.7, 19.2, 18.1, 17.8
## cyl 8, mpg: 19.2, 18.7, 17.3, 16.4, 15.8, 15.5, 15.2, 15.2, 15, 14.7, 14.3, 13.3, 10.4, 10.4
## cyl mpg_top3_mean mpg_top3_sd
## <num> <num> <num>
## 1: 4 32.23333 1.7559423
## 2: 6 21.13333 0.2309401
## 3: 8 18.40000 0.9848858
The messages printed above demonstrates the order of execution for each group. In fact, tons of things
just happened under the hood, but we don’t need to know those details at the moment. When i
and j
are
supplied, all we need is to make j
work for one group.
- Do a linear regression (
mpg ~ cyl + disp + hp
) in eachvs
group and show the coefficients:
mtcars1[, {
m <- lm(mpg ~ cyl + disp + hp)
coef(m)
}, by = vs]
## vs V1
## <num> <num>
## 1: 0 32.56839973
## 2: 0 -1.22126557
## 3: 0 -0.01520321
## 4: 0 -0.01154542
## 5: 1 38.05114111
## 6: 1 0.50676084
## 7: 1 -0.05396641
## 8: 1 -0.09481889
For each vs
group, the result of coef(m)
is a simple numeric vector of length 4. The result
will be regarded as 4 rows. The names of coef(m)
will be discarded as there is no place to hold them.
To make the this row-wise coefficient table look better, we could rewrite the code as following:
mtcars1[, {
m <- lm(mpg ~ cyl + disp + hp)
beta <- coef(m)
.(x = names(beta), beta = beta)
}, by = vs]
## vs x beta
## <num> <char> <num>
## 1: 0 (Intercept) 32.56839973
## 2: 0 cyl -1.22126557
## 3: 0 disp -0.01520321
## 4: 0 hp -0.01154542
## 5: 1 (Intercept) 38.05114111
## 6: 1 cyl 0.50676084
## 7: 1 disp -0.05396641
## 8: 1 hp -0.09481889
An alternative, if not better, way to do this is simply convert the result to a list
:
mtcars1[, {
m <- lm(mpg ~ cyl + disp + hp)
as.list(coef(m))
}, by = vs]
## vs (Intercept) cyl disp hp
## <num> <num> <num> <num> <num>
## 1: 0 32.56840 -1.2212656 -0.01520321 -0.01154542
## 2: 1 38.05114 0.5067608 -0.05396641 -0.09481889
In this example, we do a linear regression and extract the coefficients. Note that we convert the
result of coef(m)
from a numeric vector to a list
so that the cofficients become columns instead of rows.
We can also extend the result by including some summary statistics reported in summary(m)
:
mtcars1[, {
m <- lm(mpg ~ cyl + disp + hp)
sm <- summary(m)
c(as.list(coef(m)), r2 = sm$r.squared, adj_r2 = sm$adj.r.squared)
}, by = vs]
## vs (Intercept) cyl disp hp r2 adj_r2
## <num> <num> <num> <num> <num> <num> <num>
## 1: 0 32.56840 -1.2212656 -0.01520321 -0.01154542 0.7535160 0.7006980
## 2: 1 38.05114 0.5067608 -0.05396641 -0.09481889 0.6825405 0.5873026
From the above examples, we can see that data.table syntax is extremely flexible and powerful.
In fact, it is only a basic part of the features it has to offer. If we know more about key/index,
reshaping, special symbols like .SD
, we can do a lot more with data.table.