In the previous post,
I reviewed the basic syntax and design of data.table. In this post, let’s take a closer look at the
:=
operator and its reference semantics. It is one of the key features that sets data.table apart
from other data manipulation packages in R.
data.table puts performance on its priority from the very beginning. It is designed to be fast and efficient
in handling large data sets. To achieve this goal, data.table uses reference semantics for the :=
operator,
and some other functions such as set*
family. This is a very different design from the base R data frame.
If we use base R, we can use $
to add a column to a data frame.
df <- data.frame(x = 1:3, y = 4:6)
df$z <- 7:9
df
## x y z
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
The equivalent code in data.table is
library(data.table)
dt <- data.table(x = 1:3, y = 4:6)
dt[, z := 7:9]
dt
## x y z
## <int> <int> <int>
## 1: 1 4 7
## 2: 2 5 8
## 3: 3 6 9
Although they achieve the same goal of adding a new column z
to the original data, the two methods are
very different in terms of their semantics. In base R, where the original data is modified depends on where you
perform the operation. In the above example, a new column z
is added to the original data frame df
. However,
if df
is passed to a function, then using the same $
operator will not modify the original data frame but
creates a local (shallow) copy of df
and add the column to the copied version instead.
add_col <- function(df) {
df$w <- 8:10
df
}
df2 <- add_col(df)
df
## x y z
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
df2
## x y z w
## 1 1 4 7 8
## 2 2 5 8 9
## 3 3 6 9 10
This is the so-called copy-on-modify semantics. It occurs widely in base R functions. Another example is not adding a new column but modify a value of an existing column.
modify_col <- function(df) {
df$x[1] <- 0
df
}
df3 <- modify_col(df)
df
## x y z
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
df3
## x y z
## 1 0 4 7
## 2 2 5 8
## 3 3 6 9
The copy-on-modify mechanism works again: the original data frame df
is not modified by the function modify_col()
,
only a local copy of x
in a local copy of df
is modified.
In contrast, data.table uses reference semantics for the :=
operator. This means that the original data is modified
in place no matter where we modify the data.table object. This largely reduces the overhead of copying data structures,
which is the main reason why base R operations could be slow when handling large data sets without carefully minimzing
copy behaviors.
To demonstrate the reference semantics, let’s first create a data.table object dt
and then pass it to a function.
dt_add_col <- function(dt) {
dt[, w := 8:10]
dt
}
dt2 <- dt_add_col(dt)
print(dt)
## x y z w
## <int> <int> <int> <int>
## 1: 1 4 7 8
## 2: 2 5 8 9
## 3: 3 6 9 10
print(dt2)
## x y z w
## <int> <int> <int> <int>
## 1: 1 4 7 8
## 2: 2 5 8 9
## 3: 3 6 9 10
The original data table dt
is modified by the function dt_add_col()
. This is because the :=
operator is a
special operator that modifies the original data table in place. In other words, dt
is not copied when it is passed
into the function dt_add_col
, nor is it copied when the :=
operator is used to add a new column w
. In fact,
dt
and dt2
(returned from the function) are referring to exactly the same object.
Besides adding a new column, we can also use :=
to modify existing columns. We can use i
to specify which rows to modify:
dt[1, x := 0]
dt
## x y z w
## <int> <int> <int> <int>
## 1: 0 4 7 8
## 2: 2 5 8 9
## 3: 3 6 9 10
or use a logical condition in i
.
dt[x <= 2, y := x]
dt
## x y z w
## <int> <int> <int> <int>
## 1: 0 0 7 8
## 2: 2 2 8 9
## 3: 3 6 9 10
Consistent with the basic data.table syntax, all columns can be directly used in the right-hand-side of :=
. Also,
:=
also works nicely with by=
grouping in the sense that RHS of :=
will be calculated within each group, and
the result will be modified-in-place for the LHS column in the original data.table.
dt <- data.table(group = c("A", "A", "A", "B", "B"), x = c(1, 3, 2, 4, 5), y = 5:1)
dt[, x_std := (x - min(x)) / (max(x) - min(x)), by = group]
dt
## group x y x_std
## <char> <num> <int> <num>
## 1: A 1 5 0.0
## 2: A 3 4 1.0
## 3: A 2 3 0.5
## 4: B 4 2 0.0
## 5: B 5 1 1.0
To summarize, the copy-on-modify semantics in base R preserves the original data and could be safer in many cases but can be slow and memory-intensive, while reference semantics in data.table provide a more efficient way to update data structures by modifying the data in place, which might lead to unexpected results if you are not aware of the difference in semantics. Following are some examples:
Some users extract a certain column of a data.table and pass it to other functions for further processing.
x_std <- dt$x_std
print(x_std)
## [1] 0.0 1.0 0.5 0.0 1.0
However, if I modify the original column using :=
with a condition i
or grouping by
, the original data
is modified, and so are all symbols referring to the same data.
dt[, x_std := x / max(x), by = group]
dt
## group x y x_std
## <char> <num> <int> <num>
## 1: A 1 5 0.3333333
## 2: A 3 4 1.0000000
## 3: A 2 3 0.6666667
## 4: B 4 2 0.8000000
## 5: B 5 1 1.0000000
Now dt$x_std
is modified, so is x_std
we extracted from dt
before.
x_std
## [1] 0.3333333 1.0000000 0.6666667 0.8000000 1.0000000
The behavior may be surprising to many users but is consistent with the reference semantics of data.table and the
behavior of the assignment operator <-
in base R: assignment does not copy but creates another symbol that refers
to exactly the same object. If the object is modified somehow, you will observe the change via any of the symbols
referring to the that object.
data.table authors are trying their best to make the reference semantics more explicit and easier to understand, and
there is a dedicated vignette to
introduce the concept and demonstrate the behavior. Users should be careful of and better avoid mixing the
base R and data.table semantics. Whenever you use :=
or set*
, you should be aware of the reference semantics
and possibility that all other symbols referring to the same data will observe the change.