Learning data.table: Reference semantics and its pros and cons

In the previous post, I reviewed the basic syntax and design of data.table. In this post, let’s take a closer look at the := operator and its reference semantics. It is one of the key features that sets data.table apart from other data manipulation packages in R.

data.table puts performance on its priority from the very beginning. It is designed to be fast and efficient in handling large data sets. To achieve this goal, data.table uses reference semantics for the := operator, and some other functions such as set* family. This is a very different design from the base R data frame.

If we use base R, we can use $ to add a column to a data frame.

df <- data.frame(x = 1:3, y = 4:6)
df$z <- 7:9
df

##   x y z
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

The equivalent code in data.table is

library(data.table)
dt <- data.table(x = 1:3, y = 4:6)
dt[, z := 7:9]
dt

##        x     y     z
##    <int> <int> <int>
## 1:     1     4     7
## 2:     2     5     8
## 3:     3     6     9

Although they achieve the same goal of adding a new column z to the original data, the two methods are very different in terms of their semantics. In base R, where the original data is modified depends on where you perform the operation. In the above example, a new column z is added to the original data frame df. However, if df is passed to a function, then using the same $ operator will not modify the original data frame but creates a local (shallow) copy of df and add the column to the copied version instead.

add_col <- function(df) {
  df$w <- 8:10
  df
}

df2 <- add_col(df)
df

##   x y z
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

df2

##   x y z  w
## 1 1 4 7  8
## 2 2 5 8  9
## 3 3 6 9 10

This is the so-called copy-on-modify semantics. It occurs widely in base R functions. Another example is not adding a new column but modify a value of an existing column.

modify_col <- function(df) {
  df$x[1] <- 0
  df
}

df3 <- modify_col(df)
df

##   x y z
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

df3

##   x y z
## 1 0 4 7
## 2 2 5 8
## 3 3 6 9

The copy-on-modify mechanism works again: the original data frame df is not modified by the function modify_col(), only a local copy of x in a local copy of df is modified.

In contrast, data.table uses reference semantics for the := operator. This means that the original data is modified in place no matter where we modify the data.table object. This largely reduces the overhead of copying data structures, which is the main reason why base R operations could be slow when handling large data sets without carefully minimzing copy behaviors.

To demonstrate the reference semantics, let’s first create a data.table object dt and then pass it to a function.

dt_add_col <- function(dt) {
  dt[, w := 8:10]
  dt
}

dt2 <- dt_add_col(dt)
print(dt)

##        x     y     z     w
##    <int> <int> <int> <int>
## 1:     1     4     7     8
## 2:     2     5     8     9
## 3:     3     6     9    10

print(dt2)

##        x     y     z     w
##    <int> <int> <int> <int>
## 1:     1     4     7     8
## 2:     2     5     8     9
## 3:     3     6     9    10

The original data table dt is modified by the function dt_add_col(). This is because the := operator is a special operator that modifies the original data table in place. In other words, dt is not copied when it is passed into the function dt_add_col, nor is it copied when the := operator is used to add a new column w. In fact, dt and dt2 (returned from the function) are referring to exactly the same object.

Besides adding a new column, we can also use := to modify existing columns. We can use i to specify which rows to modify:

dt[1, x := 0]
dt

##        x     y     z     w
##    <int> <int> <int> <int>
## 1:     0     4     7     8
## 2:     2     5     8     9
## 3:     3     6     9    10

or use a logical condition in i.

dt[x <= 2, y := x]
dt

##        x     y     z     w
##    <int> <int> <int> <int>
## 1:     0     0     7     8
## 2:     2     2     8     9
## 3:     3     6     9    10

Consistent with the basic data.table syntax, all columns can be directly used in the right-hand-side of :=. Also, := also works nicely with by= grouping in the sense that RHS of := will be calculated within each group, and the result will be modified-in-place for the LHS column in the original data.table.

dt <- data.table(group = c("A", "A", "A", "B", "B"), x = c(1, 3, 2, 4, 5), y = 5:1)
dt[, x_std := (x - min(x)) / (max(x) - min(x)), by = group]
dt

##     group     x     y x_std
##    <char> <num> <int> <num>
## 1:      A     1     5   0.0
## 2:      A     3     4   1.0
## 3:      A     2     3   0.5
## 4:      B     4     2   0.0
## 5:      B     5     1   1.0

To summarize, the copy-on-modify semantics in base R preserves the original data and could be safer in many cases but can be slow and memory-intensive, while reference semantics in data.table provide a more efficient way to update data structures by modifying the data in place, which might lead to unexpected results if you are not aware of the difference in semantics. Following are some examples:

Some users extract a certain column of a data.table and pass it to other functions for further processing.

x_std <- dt$x_std
print(x_std)

## [1] 0.0 1.0 0.5 0.0 1.0

However, if I modify the original column using := with a condition i or grouping by, the original data is modified, and so are all symbols referring to the same data.

dt[, x_std := x / max(x), by = group]
dt

##     group     x     y     x_std
##    <char> <num> <int>     <num>
## 1:      A     1     5 0.3333333
## 2:      A     3     4 1.0000000
## 3:      A     2     3 0.6666667
## 4:      B     4     2 0.8000000
## 5:      B     5     1 1.0000000

Now dt$x_std is modified, so is x_std we extracted from dt before.

x_std

## [1] 0.3333333 1.0000000 0.6666667 0.8000000 1.0000000

The behavior may be surprising to many users but is consistent with the reference semantics of data.table and the behavior of the assignment operator <- in base R: assignment does not copy but creates another symbol that refers to exactly the same object. If the object is modified somehow, you will observe the change via any of the symbols referring to the that object.

data.table authors are trying their best to make the reference semantics more explicit and easier to understand, and there is a dedicated vignette to introduce the concept and demonstrate the behavior. Users should be careful of and better avoid mixing the base R and data.table semantics. Whenever you use := or set*, you should be aware of the reference semantics and possibility that all other symbols referring to the same data will observe the change.