In my every day work, data.table has been playing a crucial role to work with tabular data for more than 8 years. It has its pros and cons in terms of syntax and extensibility, yet I don’t find a better alternative to it when it comes to handling large in-memory tables with great performance.

For beginners, it might be a bit challenging to learn the syntax of `data.table`

in the first place.
In this post, I will try to motivate and explain the syntax and its design in a simple way.

If you are an absolute beginner and only know a little base R syntax, your data wrangling could look like this:

- View the first few rows of the built-in
`mtcars`

dataset:

`head(mtcars)`

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

- Filter
`mtcars`

so that we only keep the cars with`mpg`

greater than its 80% quantile:

`mtcars[mtcars$mpg > quantile(mtcars$mpg, 0.8), ]`

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
```

- Select the first 5 rows and only keep a few columns:

`mtcars[1:5, c("mpg", "cyl", "disp", "hp")]`

```
## mpg cyl disp hp
## Mazda RX4 21.0 6 160 110
## Mazda RX4 Wag 21.0 6 160 110
## Datsun 710 22.8 4 108 93
## Hornet 4 Drive 21.4 6 258 110
## Hornet Sportabout 18.7 8 360 175
```

- Sort the rows by
`mpg`

in descending order, only keep a few columns, and view the top five rows:

`mtcars[order(-mtcars$mpg), c("mpg", "cyl", "disp", "hp")][1:5, ]`

```
## mpg cyl disp hp
## Toyota Corolla 33.9 4 71.1 65
## Fiat 128 32.4 4 78.7 66
## Honda Civic 30.4 4 75.7 52
## Lotus Europa 30.4 4 95.1 113
## Fiat X1-9 27.3 4 79.0 66
```

The base R syntax `data[i, j]`

is quite intuitive and easy to understand:

`i`

: filter rows by index or condition or reorder rows`j`

: select columns by index or name

However, if you want to do something more complex, the syntax will be more complicated too.

For example, what is the average `mpg`

of cars with each number of `cyl`

? If only base R
functions are allowed, you might write something like this:

`tapply(mtcars$mpg, mtcars$cyl, mean)`

```
## 4 6 8
## 26.66364 19.74286 15.10000
```

Yes, `tapply`

does the trick of grouping. It discovers that `mtcars$cyl`

has 3 unique values and
calculates the average `mpg`

for each group.

What about for each `cyl`

value, find the value of `disp`

of the car with the highest `mpg`

?
Again, you might write something like this:

```
tapply(seq_len(nrow(mtcars)), mtcars$cyl, function(i) {
mtcars$disp[which.max(mtcars$mpg[i])]
})
```

```
## 4 6 8
## 225.0 108.0 275.8
```

Maybe there is a better way to do this in base R, but it will certainly take some time for a stranger to understand its intention when the above code is provided. It is hard to write because you have to twist your brain to think in a lower-level way of how vectors involved should be carefully subsetted. As a result, those nested vector operations work but at the expense of readability.

If we use data.table, first we need to attach the library and convert `mtcars`

to a data.table
so that data.table syntax can be used:

```
library(data.table)
mtcars1 <- as.data.table(mtcars)
```

Now we write the following code to do the same job:

`mtcars1[order(-mpg), .(disp = first(disp)), by = cyl]`

```
## cyl disp
## <num> <num>
## 1: 4 71.1
## 2: 6 258.0
## 3: 8 400.0
```

The syntax can be read: order the data by `mpg`

in descending order,
select the first value of `disp`

by each `cyl`

group. This is exactly what we want.

The `.()`

call seems a bit quirky at first glance. It is basically a shortcut for `list()`

.

Wait, the results are different? Something must be wrong.

We just made a subtle mistake. In the base R code, `which.max(mtcars$mpg[i])`

returns a vector of
the same length of `i`

as a subset of mtcars row numbers grouped by `cyl`

,
we must do the same subsetting with `mtcars$disp`

so that the indexing is consistent.

```
tapply(seq_len(nrow(mtcars)), mtcars$cyl, function(i) {
mtcars$disp[i][which.max(mtcars$mpg[i])]
})
```

```
## 4 6 8
## 71.1 258.0 400.0
```

Now it yields a consistent result. As we write some more verbose and less expressive code, it is much more likely to make mistakes.

Now we can take a closer look at the data.table syntax and see how it is a fundemental improvement over base R syntax. The syntax is as simple as the following:

`data[i, j, by]`

`i`

: filter or order rows.`j`

: any expression to be evaluated on the filtered/ordered rows in each`by`

group.`by`

: group by variables.

In data.table syntax, `i`

, `j`

, and `by`

all are evaluated in the scope of the supplied data so that the
variable name of the data itself (e.g. `mtcars`

) is no longer needed. The syntax is simple yet powerful
enough to handle many data wrangling tasks in a concise way. In most cases, complicated tasks could
be broken down into smaller pieces where each could be handled by easier steps.

Following are some more examples on `mtcars1`

to demonstrate how flexible the syntax is.

- Find the average and sd of
`mpg`

for each`cyl`

group:

```
mtcars1[, .(
mpg_mean = mean(mpg),
mpg_sd = sd(mpg)
), by = cyl]
```

```
## cyl mpg_mean mpg_sd
## <num> <num> <num>
## 1: 6 19.74286 1.453567
## 2: 4 26.66364 4.509828
## 3: 8 15.10000 2.560048
```

- Find the average and sd of top 3
`mpg`

values for each`cyl`

group:

```
mtcars1[order(-mpg), .(
mpg_top3_mean = mean(head(mpg, 3)),
mpg_top3_sd = mean(head(mpg, 3))
), by = cyl]
```

```
## cyl mpg_top3_mean mpg_top3_sd
## <num> <num> <num>
## 1: 4 32.23333 32.23333
## 2: 6 21.13333 21.13333
## 3: 8 18.40000 18.40000
```

The top 3 values of `mpg`

are calculated twice. We could rewrite the code so that it is only
calculated once:

```
mtcars1[order(-mpg), {
mpg_top3 <- head(mpg, 3)
.(
mpg_top3_mean = mean(mpg_top3),
mpg_top3_sd = sd(mpg_top3)
)
}, by = cyl]
```

```
## cyl mpg_top3_mean mpg_top3_sd
## <num> <num> <num>
## 1: 4 32.23333 1.7559423
## 2: 6 21.13333 0.2309401
## 3: 8 18.40000 0.9848858
```

We actually supplied a customized `j`

expression above. Since `by=`

is used here, data.table will try
to return a new data.table where the content of each row comes from the result of `j`

.

To know more about how `j`

is evaluated, we could even insert some logging calls in `j`

.

```
mtcars1[order(-mpg), {
cat("cyl ", cyl, ", mpg: ", toString(mpg), "\n", sep = "")
mpg_top3 <- head(mpg, 3)
.(
mpg_top3_mean = mean(mpg_top3),
mpg_top3_sd = sd(mpg_top3)
)
}, by = cyl]
```

```
## cyl 4, mpg: 33.9, 32.4, 30.4, 30.4, 27.3, 26, 24.4, 22.8, 22.8, 21.5, 21.4
## cyl 6, mpg: 21.4, 21, 21, 19.7, 19.2, 18.1, 17.8
## cyl 8, mpg: 19.2, 18.7, 17.3, 16.4, 15.8, 15.5, 15.2, 15.2, 15, 14.7, 14.3, 13.3, 10.4, 10.4
```

```
## cyl mpg_top3_mean mpg_top3_sd
## <num> <num> <num>
## 1: 4 32.23333 1.7559423
## 2: 6 21.13333 0.2309401
## 3: 8 18.40000 0.9848858
```

The messages printed above demonstrates the order of execution for each group. In fact, tons of things
just happened under the hood, but we don’t need to know those details at the moment. When `i`

and `j`

are
supplied, all we need is to make `j`

work for one group.

- Do a linear regression (
`mpg ~ cyl + disp + hp`

) in each`vs`

group and show the coefficients:

```
mtcars1[, {
m <- lm(mpg ~ cyl + disp + hp)
coef(m)
}, by = vs]
```

```
## vs V1
## <num> <num>
## 1: 0 32.56839973
## 2: 0 -1.22126557
## 3: 0 -0.01520321
## 4: 0 -0.01154542
## 5: 1 38.05114111
## 6: 1 0.50676084
## 7: 1 -0.05396641
## 8: 1 -0.09481889
```

For each `vs`

group, the result of `coef(m)`

is a simple numeric vector of length 4. The result
will be regarded as 4 rows. The names of `coef(m)`

will be discarded as there is no place to hold them.

To make the this row-wise coefficient table look better, we could rewrite the code as following:

```
mtcars1[, {
m <- lm(mpg ~ cyl + disp + hp)
beta <- coef(m)
.(x = names(beta), beta = beta)
}, by = vs]
```

```
## vs x beta
## <num> <char> <num>
## 1: 0 (Intercept) 32.56839973
## 2: 0 cyl -1.22126557
## 3: 0 disp -0.01520321
## 4: 0 hp -0.01154542
## 5: 1 (Intercept) 38.05114111
## 6: 1 cyl 0.50676084
## 7: 1 disp -0.05396641
## 8: 1 hp -0.09481889
```

An alternative, if not better, way to do this is simply convert the result to a `list`

:

```
mtcars1[, {
m <- lm(mpg ~ cyl + disp + hp)
as.list(coef(m))
}, by = vs]
```

```
## vs (Intercept) cyl disp hp
## <num> <num> <num> <num> <num>
## 1: 0 32.56840 -1.2212656 -0.01520321 -0.01154542
## 2: 1 38.05114 0.5067608 -0.05396641 -0.09481889
```

In this example, we do a linear regression and extract the coefficients. Note that we convert the
result of `coef(m)`

from a numeric vector to a `list`

so that the cofficients become columns instead of rows.

We can also extend the result by including some summary statistics reported in `summary(m)`

:

```
mtcars1[, {
m <- lm(mpg ~ cyl + disp + hp)
sm <- summary(m)
c(as.list(coef(m)), r2 = sm$r.squared, adj_r2 = sm$adj.r.squared)
}, by = vs]
```

```
## vs (Intercept) cyl disp hp r2 adj_r2
## <num> <num> <num> <num> <num> <num> <num>
## 1: 0 32.56840 -1.2212656 -0.01520321 -0.01154542 0.7535160 0.7006980
## 2: 1 38.05114 0.5067608 -0.05396641 -0.09481889 0.6825405 0.5873026
```

From the above examples, we can see that data.table syntax is extremely flexible and powerful.
In fact, it is only a basic part of the features it has to offer. If we know more about key/index,
reshaping, special symbols like `.SD`

, we can do a lot more with data.table.