For R beginners, the first operator they use is probably the assignment operator <-
. Google’s R Style Guide suggests the usage of <-
rather than =
even though the equal sign is also allowed in R to do exactly the same thing when we assign a value to a variable. However, you might feel inconvenient because you need to type two characters to represent one symbol, which is different from many other programming languages.
As a result, many users ask Why we should use <-
as the assignment operator?
Here I provide a simple explanation to the subtle difference between <-
and =
in R.
First, let’s look at an example.
x <- rnorm(100)
y <- 2*x + rnorm(100)
lm(formula=y~x)
The above code uses both <-
and =
symbols, but the work they do are different. <-
in the first two lines are used as assignment operator while =
in the third line does not serves as assignment operator but an operator that specifies a named parameter formula
for lm
function.
In other words, <-
evaluates the the expression on its right side (rnorm(100)
) and assign the evaluated value to the symbol (variable) on the left side (x
) in the current environment. =
evaluates the expression on its right side (y~x
) and set the evaluated value to the parameter of the name specified on the left side (formula
) for a certain function.
We know that <-
and =
are perfectly equivalent when they are used as assignment operators.
Therefore, the above code is equivalent to the following code:
x = rnorm(100)
y = 2*x + rnorm(100)
lm(formula=y~x)
Here, we only use =
but for two different purposes: in the first and second lines we use =
as assignment operator and in the third line we use =
as a specifier of named parameter.
Now let’s see what happens if we change all =
symbols to <-
.
x <- rnorm(100)
y <- 2*x + rnorm(100)
lm(formula <- y~x)
If you run this code, you will find that the output are similar. But if you inspect the environment, you will observe the difference: a new variable formula
is defined in the environment whose value is y~x
. So what happens?
Actually, in the third line, two things happened: First, we introduce a new symbol (variable) formula
to the environment and assign it a formula-typed value y~x
. Then, the value of formula
is provided to the first paramter of function lm
rather than, accurately speaking, to the parameter named formula
, although this time they mean the identical parameter of the function.
To test it, we conduct an experiment. This time we first prepare the data.
x <- rnorm(100)
y <- 2*x+rnorm(100)
z <- 3*x+rnorm(100)
data <- data.frame(z,x,y)
rm(x,y,z)
Basically, we just did similar things as before except that we store all vectors in a data frame and clear those numeric vectors from the environment. We know that lm
function accepts a data frame as the data source when a formula is specified.
Standard usage:
lm(formula=z~x+y,data=data)
Working alternative where two named parameters are reordered:
lm(data=data,formula=z~x+y)
Working alternative with side effects that two new variable are defined:
lm(formula <- z~x+y, data <- data)
Nonworking example:
lm(data <- data, formula <- z~x+y)
The reason is exactly what I mentioned previously. We reassign data
to data
and give its value to the first argument (formula
) of lm
which only accepts a formula-typed value. We also try to assign z~x+y
to a new variable formula
and give it to the second argument (data
) of lm
which only accepts a data frame-typed value. Both types of the parameter we provide to lm
are wrong, so we receive the message:
Error in as.data.frame.default(data) :
cannot coerce class ""formula"" to a data.frame
From the above examples and experiments, the bottom line gets clear: to reduce ambiguity, we should use either <-
or =
as assignment operator, and only use =
as named-parameter specifier for functions.
In conclusion, for better readability of R code, I suggest that we only use <-
for assignment and =
for specifying named parameters.