R’s meta-programming capabilities is one of my favorite aspects of R. We could alter a parsed expression like ordinary objects and evaluate it in a customized environment. This is so-called non-standard evaluation. It makes R so flexible and powerful in data wrangling. However, things could easily mess up if the standard and non-standard scoping are not carefully designed. Instead of manipulating expressions and scoping rules in fancy ways, we could analyze R code and see if there are potential issues before running.
lintr is a good example of doing useful static code analysis. It provides a collections of linters. Each linter could provide suggestions on code style on certain aspects or catch unsafe or unrecommended usages of some kind. The linters are mostly based on analyzing the parse data of user code.
languageserver is another example that heavily relies on static code analysis to provide code completions, function signatures, go-to-definition, finding references, renaming symbols, etc. for code editors that support the Language Server Protocol as introduced in my previous article.
This article is a brief introduction to code analysis based on parse data. To begin with, we consider a simple example:
x <- 1
The above R expression is a simple assignment that assigns a numeric vector (a single number 1
) to symbol x
. To look into this expression, we could
use parse()
and we will get a R expression object which is essentially a list of parsed R expressions.
expr <- parse(text = "x <- 1")
expr
## expression(x <- 1)
If we take a closer look at this expr
object, we could find that it has some attributes including some link between the parsed expression and the underlying
text.
str(expr)
## length 1 expression(x <- 1)
## - attr(*, "srcref")=List of 1
## ..$ : 'srcref' int [1:8] 1 1 1 6 1 6 1 1
## .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f81df2a7500>
## - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f81df2a7500>
## - attr(*, "wholeSrcref")= 'srcref' int [1:8] 1 0 2 0 0 0 1 2
## ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f81df2a7500>
In fact, these attributes are the ranges (start line, start column, end line, end column, etc.) in text of those parsed tokens. To have a better organized view of the
structure of the expression in terms of those ranges of tokens, we could use getParseData()
to obtain
a data.frame
which is easier to read.
pd <- getParseData(expr)
pd
## line1 col1 line2 col2 id parent token terminal text
## 7 1 1 1 6 7 0 expr FALSE
## 1 1 1 1 1 1 3 SYMBOL TRUE x
## 3 1 1 1 1 3 7 expr FALSE
## 2 1 3 1 4 2 7 LEFT_ASSIGN TRUE <-
## 4 1 6 1 6 4 5 NUM_CONST TRUE 1
## 5 1 6 1 6 5 7 expr FALSE
The above data.frame
displays the structure of the whole expression. As its name suggests, it is the parse data of the expression which provides the following information:
- The range of token:
line1
,col1
,line2
andcol2
. - The parent of token: If multiple tokens have the same parent, they belong to the same expression.
- The type of the token: Each token has a type recognized by the parser.
To extract all SYMBOL
tokens, we could subset the parse data pd
using pd$token
:
pd[pd$token == "SYMBOL", ]
## line1 col1 line2 col2 id parent token terminal text
## 1 1 1 1 1 1 3 SYMBOL TRUE x
Following is an example of two expressions.
n <- 10
y <- rnorm(n, mean = 1)
It has two assignment expressions, the first line is a simple assignment and the second line involves a function call with a symbol and a specific named argument.
Again we use parse()
to parse the text into parsed expressions and use getParseData()
to obtain an well-organized table of parse data.
expr <- parse(text = "
n <- 10
y <- rnorm(n, mean = 1)
")
pd <- getParseData(expr)
pd
## line1 col1 line2 col2 id parent token terminal text
## 9 2 1 2 7 9 0 expr FALSE
## 3 2 1 2 1 3 5 SYMBOL TRUE n
## 5 2 1 2 1 5 9 expr FALSE
## 4 2 3 2 4 4 9 LEFT_ASSIGN TRUE <-
## 6 2 6 2 7 6 7 NUM_CONST TRUE 10
## 7 2 6 2 7 7 9 expr FALSE
## 34 3 1 3 23 34 0 expr FALSE
## 12 3 1 3 1 12 14 SYMBOL TRUE y
## 14 3 1 3 1 14 34 expr FALSE
## 13 3 3 3 4 13 34 LEFT_ASSIGN TRUE <-
## 32 3 6 3 23 32 34 expr FALSE
## 15 3 6 3 10 15 17 SYMBOL_FUNCTION_CALL TRUE rnorm
## 17 3 6 3 10 17 32 expr FALSE
## 16 3 11 3 11 16 32 '(' TRUE (
## 18 3 12 3 12 18 20 SYMBOL TRUE n
## 20 3 12 3 12 20 32 expr FALSE
## 19 3 13 3 13 19 32 ',' TRUE ,
## 24 3 15 3 18 24 32 SYMBOL_SUB TRUE mean
## 25 3 20 3 20 25 32 EQ_SUB TRUE =
## 26 3 22 3 22 26 27 NUM_CONST TRUE 1
## 27 3 22 3 22 27 32 expr FALSE
## 28 3 23 3 23 28 32 ')' TRUE )
Now the parse data looks much richer as the number of rows the the data frame above is much more than the previous example. As you could imagine, a typical R script with hundreds of lines of code could produce probably produce thousands of rows of such parse data.
Again, we could extract all SYMBOL
tokens using the same method:
pd[pd$token == "SYMBOL", ]
## line1 col1 line2 col2 id parent token terminal text
## 3 2 1 2 1 3 5 SYMBOL TRUE n
## 12 3 1 3 1 12 14 SYMBOL TRUE y
## 18 3 12 3 12 18 20 SYMBOL TRUE n
Now we got three SYMBOL
tokens, each used in a different location and plays a different role.
Similarly, if we want to know what functions do we call in the parsed expressions, we could simply extract all SYMBOL_FUNCTION_CALL
tokens from the
parse data:
pd[pd$token == "SYMBOL_FUNCTION_CALL", ]
## line1 col1 line2 col2 id parent token terminal text
## 15 3 6 3 10 15 17 SYMBOL_FUNCTION_CALL TRUE rnorm
The above examples might be the simplest use of the parse data. In practical static code analysis, we usually need to do something more complex than that.
For example, what if we want to know all symbols we defined via <-
assignment operators?
This sounds less straightforward since it cannot be done with a simple subset of the parse data. Instead, we might be able to do it with multiple steps:
- Find all the
<-
assignment operators. - Find the
SYMBOL
on the left of each<-
.
But what does “on the left” mean?
It looks like we have to translate these operations into the logical relationships between the tokens in terms of their ranges and parent.
- Find all the
<-
assignment operators (LEFT_ASSIGN
as the token type). - For each
<-
:- Look at its parent and know which
expr
it belongs to - Find the first
SYMBOL
whose parent isexpr
- Look at its parent and know which
The following code is an attempt to finding all assignment expressions:
assign_id <- pd[pd$token == "LEFT_ASSIGN", "parent"]
lapply(assign_id, function(id) {
pd[pd$parent == id, ]
})
## [[1]]
## line1 col1 line2 col2 id parent token terminal text
## 5 2 1 2 1 5 9 expr FALSE
## 4 2 3 2 4 4 9 LEFT_ASSIGN TRUE <-
## 7 2 6 2 7 7 9 expr FALSE
##
## [[2]]
## line1 col1 line2 col2 id parent token terminal text
## 14 3 1 3 1 14 34 expr FALSE
## 13 3 3 3 4 13 34 LEFT_ASSIGN TRUE <-
## 32 3 6 3 23 32 34 expr FALSE
To extract the LHS symbol in these assignment expressions, we could find the id
of the first expr
in each assignment expression,
and see which SYMBOL
has a parent of that id
.
lapply(assign_id, function(id) {
expr_id <- pd[pd$parent == id, "id"][1]
pd[pd$token == "SYMBOL" & pd$parent == expr_id, ]
})
## [[1]]
## line1 col1 line2 col2 id parent token terminal text
## 3 2 1 2 1 3 5 SYMBOL TRUE n
##
## [[2]]
## line1 col1 line2 col2 id parent token terminal text
## 12 3 1 3 1 12 14 SYMBOL TRUE y
Now we get all SYMBOL
tokens we define in the parsed expressions. The above method should work with most code but it obviously ignores
the case of assignment via ->
and =
. Also, a for
loop could define new symbols too. Just imagine how complex the conditions will be if
we want to cover all these cases.
The XML approach
The parse data is essentially a table representation of an abstract syntax tree generated by the R parser. XML is a much more powerful representation of such tree structure and we could use XPath to concisely represent a wide range of logical conditions to select one or a set of XML nodes. It is widely used in web development and data scraping. To know more basics about XML, the XML Tutorial could be helpful to begin with.
xmlparsedata is a package that generates XML text from parsed R expressions. We could use this package to transform the R expressions into XML representation and use xml2 to read the XML text and query the XML document easily with XPath expressions.
The following code are examples in attempt to reproducing what we have already done above. Let’s see how much XML and XPath could make the work easier.
First, we transform the expressions into XML text:
xml_text <- xmlparsedata::xml_parse_data(expr, pretty = TRUE)
cat(xml_text)
## <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
## <exprlist>
## <expr line1="2" col1="1" line2="2" col2="7" start="49" end="55">
## <expr line1="2" col1="1" line2="2" col2="1" start="49" end="49">
## <SYMBOL line1="2" col1="1" line2="2" col2="1" start="49" end="49">n</SYMBOL>
## </expr>
## <LEFT_ASSIGN line1="2" col1="3" line2="2" col2="4" start="51" end="52"><-</LEFT_ASSIGN>
## <expr line1="2" col1="6" line2="2" col2="7" start="54" end="55">
## <NUM_CONST line1="2" col1="6" line2="2" col2="7" start="54" end="55">10</NUM_CONST>
## </expr>
## </expr>
## <expr line1="3" col1="1" line2="3" col2="23" start="73" end="95">
## <expr line1="3" col1="1" line2="3" col2="1" start="73" end="73">
## <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>
## </expr>
## <LEFT_ASSIGN line1="3" col1="3" line2="3" col2="4" start="75" end="76"><-</LEFT_ASSIGN>
## <expr line1="3" col1="6" line2="3" col2="23" start="78" end="95">
## <expr line1="3" col1="6" line2="3" col2="10" start="78" end="82">
## <SYMBOL_FUNCTION_CALL line1="3" col1="6" line2="3" col2="10" start="78" end="82">rnorm</SYMBOL_FUNCTION_CALL>
## </expr>
## <OP-LEFT-PAREN line1="3" col1="11" line2="3" col2="11" start="83" end="83">(</OP-LEFT-PAREN>
## <expr line1="3" col1="12" line2="3" col2="12" start="84" end="84">
## <SYMBOL line1="3" col1="12" line2="3" col2="12" start="84" end="84">n</SYMBOL>
## </expr>
## <OP-COMMA line1="3" col1="13" line2="3" col2="13" start="85" end="85">,</OP-COMMA>
## <SYMBOL_SUB line1="3" col1="15" line2="3" col2="18" start="87" end="90">mean</SYMBOL_SUB>
## <EQ_SUB line1="3" col1="20" line2="3" col2="20" start="92" end="92">=</EQ_SUB>
## <expr line1="3" col1="22" line2="3" col2="22" start="94" end="94">
## <NUM_CONST line1="3" col1="22" line2="3" col2="22" start="94" end="94">1</NUM_CONST>
## </expr>
## <OP-RIGHT-PAREN line1="3" col1="23" line2="3" col2="23" start="95" end="95">)</OP-RIGHT-PAREN>
## </expr>
## </expr>
## </exprlist>
The above output is the XML representation of the parse data. It might look verbose and not straightforward at the beginning but if you get more familiar with the XML syntax, it would be easier to find that the XML syntax is exactly using a nested structure to represent the syntax tree.
Note that the token type is the name of the XML nodes and the token ranges are represented by several attributes such as line1
, col1
, line2
and col2
. The token text is put as the value of the XML nodes.
You might notice that no <-
appears in the XML text. It is because characters like <
, >
are not allowed in XML node values. These characters
are escaped so that <-
becomes <-
in the value of LEFT_ASSIGN
node. For more details about XML escaping, see https://www.w3.org/TR/xml/#syntax.
Also, characters like <
, >
, (
, {
are not allowed in the XML node names. To walk-around, xml_parse_data
transforms these characters into valid
XML node names. xmlparsedata::xml_parse_token_map
stores the mapping.
Next, we use xml2::read_xml()
to read the XML text and get a parsed XML document that is ready for query.
library(xml2)
xml <- read_xml(xml_text)
Then we could use XPath to represent our selection conditions. For example, to select all XML nodes of SYMBOL
at any level, we use a XPath expression
//SYMBOL
:
syms <- xml_find_all(xml, "//SYMBOL")
syms
## {xml_nodeset (3)}
## [1] <SYMBOL line1="2" col1="1" line2="2" col2="1" start="49" end="49">n</SYMBOL>
## [2] <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>
## [3] <SYMBOL line1="3" col1="12" line2="3" col2="12" start="84" end="84">n</SYMBOL>
Now we get all XML nodes selected by the above XPath expression. Then we use xml_text()
to extract their inner texts as all the symbols that appear in
the code.
xml_text(syms)
## [1] "n" "y" "n"
To get all symbols we defined in the code, a single XPath like the following could do the work:
xml_find_all(xml,
"//expr[LEFT_ASSIGN]/expr[1]/SYMBOL")
## {xml_nodeset (2)}
## [1] <SYMBOL line1="2" col1="1" line2="2" col2="1" start="49" end="49">n</SYMBOL>
## [2] <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>
Basically, it selects all expr
s with a child node of LEFT_ASSIGN
, then select its first child node of expr
, then select the SYMBOL
child node.
If anything in this chain is absent, then it won’t appear in the the result set. In other words, it selects all the nodes that are compatible with this
XPath expression so we don’t have to worry about whether it is safe to access its expr[1]
and the SYMBOL
in it when we write it.
In fact, XPath is so expressive that there could be alternative ways to select the same set of XML nodes in a particular XML document. In this example, the following XPath expression also works:
xml_find_all(xml,
"//expr/LEFT_ASSIGN/preceding-sibling::expr/SYMBOL")
## {xml_nodeset (2)}
## [1] <SYMBOL line1="2" col1="1" line2="2" col2="1" start="49" end="49">n</SYMBOL>
## [2] <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>
This XPath expression is intended to select all expr
s, then select its child of LEFT_ASSIGN
, and select its preceding sibling of expr
(before the LEFT_ASSIGN
)
and then select the SYMBOL
node in it. All XML nodes that are compatible with this XPath expression will appear in the result set and the result set only
contains such nodes.
The power of XPath should be obvious now: we could optionally specify predicates at any level (via []
), and we could also use
XPath axis to specify the relationships between the context (current node) with
nodes at nested levels (via e.g. /
and //
) or at the same level (via e.g. preceding-sibling
, following-sibling
)
Now consider this problem: Find symbols we defined with an expression where a function is called.
Obviously, in the parsed R expressions, the first expression n <- 10
does obviously not satisfy this condition since no function call
is involved in the calculation of its value 10
. By contrast, the second expression y <- rnorm(n, mean = 1)
does apparently satisfy
the condition as rnorm()
is the function call we are looking for.
How could we translate such a request to a XPath expression? The following is an example that works:
xml_find_all(xml,
"//expr/LEFT_ASSIGN[following-sibling::expr//SYMBOL_FUNCTION_CALL]
/preceding-sibling::expr[1]/SYMBOL")
## {xml_nodeset (1)}
## [1] <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>
Now we get an impression how the static R code analysis could be done via working with the parse data represented in either data.frame
or
XML document.
Following are some more practical code I wrote using XML and XPath to do code analysis:
missing_package_linter
added to lintr to detectlibrary()
calls with packages that are not installed.namespace_linter
added to lintr to detect invalid namespace and exported/non-exported objects accessed via::
and:::
.- Go-to-definition implemented in languageserver to find the definition of a symbol with scoping rules taken into account.
- Scope completion implemented in languageserver to find the symbols that are accessible in a particular cursur position with scoping rules taken into account.
With some basic knowledge of static analysis of R code, we can write our own linters that check the code style or quality according to our needs or standards.
If you are interested in XML and XPath, I recommend that you take a look at the following tutorials or online tools: