Using parse data to analyze R code - Kun Ren's Blog Posts

R’s meta-programming capabilities is one of my favorite aspects of R. We could alter a parsed expression like ordinary objects and evaluate it in a customized environment. This is so-called non-standard evaluation. It makes R so flexible and powerful in data wrangling. However, things could easily mess up if the standard and non-standard scoping are not carefully designed. Instead of manipulating expressions and scoping rules in fancy ways, we could analyze R code and see if there are potential issues before running.

lintr is a good example of doing useful static code analysis. It provides a collections of linters. Each linter could provide suggestions on code style on certain aspects or catch unsafe or unrecommended usages of some kind. The linters are mostly based on analyzing the parse data of user code.

languageserver is another example that heavily relies on static code analysis to provide code completions, function signatures, go-to-definition, finding references, renaming symbols, etc. for code editors that support the Language Server Protocol as introduced in my previous article.

This article is a brief introduction to code analysis based on parse data. To begin with, we consider a simple example:

x <- 1

The above R expression is a simple assignment that assigns a numeric vector (a single number 1) to symbol x. To look into this expression, we could use parse() and we will get a R expression object which is essentially a list of parsed R expressions.

expr <- parse(text = "x <- 1")
expr

## expression(x <- 1)

If we take a closer look at this expr object, we could find that it has some attributes including some link between the parsed expression and the underlying text.

str(expr)

## length 1 expression(x <- 1)
##  - attr(*, "srcref")=List of 1
##   ..$ : 'srcref' int [1:8] 1 1 1 6 1 6 1 1
##   .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f81df2a7500> 
##  - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f81df2a7500> 
##  - attr(*, "wholeSrcref")= 'srcref' int [1:8] 1 0 2 0 0 0 1 2
##   ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f81df2a7500>

In fact, these attributes are the ranges (start line, start column, end line, end column, etc.) in text of those parsed tokens. To have a better organized view of the structure of the expression in terms of those ranges of tokens, we could use getParseData() to obtain a data.frame which is easier to read.

pd <- getParseData(expr)
pd

##   line1 col1 line2 col2 id parent       token terminal text
## 7     1    1     1    6  7      0        expr    FALSE     
## 1     1    1     1    1  1      3      SYMBOL     TRUE    x
## 3     1    1     1    1  3      7        expr    FALSE     
## 2     1    3     1    4  2      7 LEFT_ASSIGN     TRUE   <-
## 4     1    6     1    6  4      5   NUM_CONST     TRUE    1
## 5     1    6     1    6  5      7        expr    FALSE

The above data.frame displays the structure of the whole expression. As its name suggests, it is the parse data of the expression which provides the following information:

The range of token: line1, col1, line2 and col2.
The parent of token: If multiple tokens have the same parent, they belong to the same expression.
The type of the token: Each token has a type recognized by the parser.

To extract all SYMBOL tokens, we could subset the parse data pd using pd$token:

pd[pd$token == "SYMBOL", ]

##   line1 col1 line2 col2 id parent  token terminal text
## 1     1    1     1    1  1      3 SYMBOL     TRUE    x

Following is an example of two expressions.

n <- 10
y <- rnorm(n, mean = 1)

It has two assignment expressions, the first line is a simple assignment and the second line involves a function call with a symbol and a specific named argument. Again we use parse() to parse the text into parsed expressions and use getParseData() to obtain an well-organized table of parse data.

expr <- parse(text = "
n <- 10
y <- rnorm(n, mean = 1)
")
pd <- getParseData(expr)
pd

##    line1 col1 line2 col2 id parent                token terminal  text
## 9      2    1     2    7  9      0                 expr    FALSE      
## 3      2    1     2    1  3      5               SYMBOL     TRUE     n
## 5      2    1     2    1  5      9                 expr    FALSE      
## 4      2    3     2    4  4      9          LEFT_ASSIGN     TRUE    <-
## 6      2    6     2    7  6      7            NUM_CONST     TRUE    10
## 7      2    6     2    7  7      9                 expr    FALSE      
## 34     3    1     3   23 34      0                 expr    FALSE      
## 12     3    1     3    1 12     14               SYMBOL     TRUE     y
## 14     3    1     3    1 14     34                 expr    FALSE      
## 13     3    3     3    4 13     34          LEFT_ASSIGN     TRUE    <-
## 32     3    6     3   23 32     34                 expr    FALSE      
## 15     3    6     3   10 15     17 SYMBOL_FUNCTION_CALL     TRUE rnorm
## 17     3    6     3   10 17     32                 expr    FALSE      
## 16     3   11     3   11 16     32                  '('     TRUE     (
## 18     3   12     3   12 18     20               SYMBOL     TRUE     n
## 20     3   12     3   12 20     32                 expr    FALSE      
## 19     3   13     3   13 19     32                  ','     TRUE     ,
## 24     3   15     3   18 24     32           SYMBOL_SUB     TRUE  mean
## 25     3   20     3   20 25     32               EQ_SUB     TRUE     =
## 26     3   22     3   22 26     27            NUM_CONST     TRUE     1
## 27     3   22     3   22 27     32                 expr    FALSE      
## 28     3   23     3   23 28     32                  ')'     TRUE     )

Now the parse data looks much richer as the number of rows the the data frame above is much more than the previous example. As you could imagine, a typical R script with hundreds of lines of code could produce probably produce thousands of rows of such parse data.

Again, we could extract all SYMBOL tokens using the same method:

pd[pd$token == "SYMBOL", ]

##    line1 col1 line2 col2 id parent  token terminal text
## 3      2    1     2    1  3      5 SYMBOL     TRUE    n
## 12     3    1     3    1 12     14 SYMBOL     TRUE    y
## 18     3   12     3   12 18     20 SYMBOL     TRUE    n

Now we got three SYMBOL tokens, each used in a different location and plays a different role.

Similarly, if we want to know what functions do we call in the parsed expressions, we could simply extract all SYMBOL_FUNCTION_CALL tokens from the parse data:

pd[pd$token == "SYMBOL_FUNCTION_CALL", ]

##    line1 col1 line2 col2 id parent                token terminal  text
## 15     3    6     3   10 15     17 SYMBOL_FUNCTION_CALL     TRUE rnorm

The above examples might be the simplest use of the parse data. In practical static code analysis, we usually need to do something more complex than that. For example, what if we want to know all symbols we defined via <- assignment operators?

This sounds less straightforward since it cannot be done with a simple subset of the parse data. Instead, we might be able to do it with multiple steps:

Find all the <- assignment operators.
Find the SYMBOL on the left of each <-.

But what does “on the left” mean?

It looks like we have to translate these operations into the logical relationships between the tokens in terms of their ranges and parent.

Find all the <- assignment operators (LEFT_ASSIGN as the token type).
For each <-:
1. Look at its parent and know which expr it belongs to
2. Find the first SYMBOL whose parent is expr

The following code is an attempt to finding all assignment expressions:

assign_id <- pd[pd$token == "LEFT_ASSIGN", "parent"]
lapply(assign_id, function(id) {
  pd[pd$parent == id, ]
})

## [[1]]
##   line1 col1 line2 col2 id parent       token terminal text
## 5     2    1     2    1  5      9        expr    FALSE     
## 4     2    3     2    4  4      9 LEFT_ASSIGN     TRUE   <-
## 7     2    6     2    7  7      9        expr    FALSE     
## 
## [[2]]
##    line1 col1 line2 col2 id parent       token terminal text
## 14     3    1     3    1 14     34        expr    FALSE     
## 13     3    3     3    4 13     34 LEFT_ASSIGN     TRUE   <-
## 32     3    6     3   23 32     34        expr    FALSE

To extract the LHS symbol in these assignment expressions, we could find the id of the first expr in each assignment expression, and see which SYMBOL has a parent of that id.

lapply(assign_id, function(id) {
  expr_id <- pd[pd$parent == id, "id"][1]
  pd[pd$token == "SYMBOL" & pd$parent == expr_id, ]
})

## [[1]]
##   line1 col1 line2 col2 id parent  token terminal text
## 3     2    1     2    1  3      5 SYMBOL     TRUE    n
## 
## [[2]]
##    line1 col1 line2 col2 id parent  token terminal text
## 12     3    1     3    1 12     14 SYMBOL     TRUE    y

Now we get all SYMBOL tokens we define in the parsed expressions. The above method should work with most code but it obviously ignores the case of assignment via -> and =. Also, a for loop could define new symbols too. Just imagine how complex the conditions will be if we want to cover all these cases.

The XML approach

The parse data is essentially a table representation of an abstract syntax tree generated by the R parser. XML is a much more powerful representation of such tree structure and we could use XPath to concisely represent a wide range of logical conditions to select one or a set of XML nodes. It is widely used in web development and data scraping. To know more basics about XML, the XML Tutorial could be helpful to begin with.

xmlparsedata is a package that generates XML text from parsed R expressions. We could use this package to transform the R expressions into XML representation and use xml2 to read the XML text and query the XML document easily with XPath expressions.

The following code are examples in attempt to reproducing what we have already done above. Let’s see how much XML and XPath could make the work easier.

First, we transform the expressions into XML text:

xml_text <- xmlparsedata::xml_parse_data(expr, pretty = TRUE)
cat(xml_text)

## <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
## <exprlist>
##   <expr line1="2" col1="1" line2="2" col2="7" start="49" end="55">
##     <expr line1="2" col1="1" line2="2" col2="1" start="49" end="49">
##       <SYMBOL line1="2" col1="1" line2="2" col2="1" start="49" end="49">n</SYMBOL>
##     </expr>
##     <LEFT_ASSIGN line1="2" col1="3" line2="2" col2="4" start="51" end="52">&lt;-</LEFT_ASSIGN>
##     <expr line1="2" col1="6" line2="2" col2="7" start="54" end="55">
##       <NUM_CONST line1="2" col1="6" line2="2" col2="7" start="54" end="55">10</NUM_CONST>
##     </expr>
##   </expr>
##   <expr line1="3" col1="1" line2="3" col2="23" start="73" end="95">
##     <expr line1="3" col1="1" line2="3" col2="1" start="73" end="73">
##       <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>
##     </expr>
##     <LEFT_ASSIGN line1="3" col1="3" line2="3" col2="4" start="75" end="76">&lt;-</LEFT_ASSIGN>
##     <expr line1="3" col1="6" line2="3" col2="23" start="78" end="95">
##       <expr line1="3" col1="6" line2="3" col2="10" start="78" end="82">
##         <SYMBOL_FUNCTION_CALL line1="3" col1="6" line2="3" col2="10" start="78" end="82">rnorm</SYMBOL_FUNCTION_CALL>
##       </expr>
##       <OP-LEFT-PAREN line1="3" col1="11" line2="3" col2="11" start="83" end="83">(</OP-LEFT-PAREN>
##       <expr line1="3" col1="12" line2="3" col2="12" start="84" end="84">
##         <SYMBOL line1="3" col1="12" line2="3" col2="12" start="84" end="84">n</SYMBOL>
##       </expr>
##       <OP-COMMA line1="3" col1="13" line2="3" col2="13" start="85" end="85">,</OP-COMMA>
##       <SYMBOL_SUB line1="3" col1="15" line2="3" col2="18" start="87" end="90">mean</SYMBOL_SUB>
##       <EQ_SUB line1="3" col1="20" line2="3" col2="20" start="92" end="92">=</EQ_SUB>
##       <expr line1="3" col1="22" line2="3" col2="22" start="94" end="94">
##         <NUM_CONST line1="3" col1="22" line2="3" col2="22" start="94" end="94">1</NUM_CONST>
##       </expr>
##       <OP-RIGHT-PAREN line1="3" col1="23" line2="3" col2="23" start="95" end="95">)</OP-RIGHT-PAREN>
##     </expr>
##   </expr>
## </exprlist>

The above output is the XML representation of the parse data. It might look verbose and not straightforward at the beginning but if you get more familiar with the XML syntax, it would be easier to find that the XML syntax is exactly using a nested structure to represent the syntax tree.

Note that the token type is the name of the XML nodes and the token ranges are represented by several attributes such as line1, col1, line2 and col2. The token text is put as the value of the XML nodes.

You might notice that no <- appears in the XML text. It is because characters like <, > are not allowed in XML node values. These characters are escaped so that <- becomes <- in the value of LEFT_ASSIGN node. For more details about XML escaping, see https://www.w3.org/TR/xml/#syntax.

Also, characters like <, >, (, { are not allowed in the XML node names. To walk-around, xml_parse_data transforms these characters into valid XML node names. xmlparsedata::xml_parse_token_map stores the mapping.

Next, we use xml2::read_xml() to read the XML text and get a parsed XML document that is ready for query.

library(xml2)
xml <- read_xml(xml_text)

Then we could use XPath to represent our selection conditions. For example, to select all XML nodes of SYMBOL at any level, we use a XPath expression //SYMBOL:

syms <- xml_find_all(xml, "//SYMBOL")
syms

## {xml_nodeset (3)}
## [1] <SYMBOL line1="2" col1="1" line2="2" col2="1" start="49" end="49">n</SYMBOL>
## [2] <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>
## [3] <SYMBOL line1="3" col1="12" line2="3" col2="12" start="84" end="84">n</SYMBOL>

Now we get all XML nodes selected by the above XPath expression. Then we use xml_text() to extract their inner texts as all the symbols that appear in the code.

xml_text(syms)

## [1] "n" "y" "n"

To get all symbols we defined in the code, a single XPath like the following could do the work:

xml_find_all(xml,
  "//expr[LEFT_ASSIGN]/expr[1]/SYMBOL")

## {xml_nodeset (2)}
## [1] <SYMBOL line1="2" col1="1" line2="2" col2="1" start="49" end="49">n</SYMBOL>
## [2] <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>

Basically, it selects all exprs with a child node of LEFT_ASSIGN, then select its first child node of expr, then select the SYMBOL child node. If anything in this chain is absent, then it won’t appear in the the result set. In other words, it selects all the nodes that are compatible with this XPath expression so we don’t have to worry about whether it is safe to access its expr[1] and the SYMBOL in it when we write it.

In fact, XPath is so expressive that there could be alternative ways to select the same set of XML nodes in a particular XML document. In this example, the following XPath expression also works:

xml_find_all(xml,
  "//expr/LEFT_ASSIGN/preceding-sibling::expr/SYMBOL")

## {xml_nodeset (2)}
## [1] <SYMBOL line1="2" col1="1" line2="2" col2="1" start="49" end="49">n</SYMBOL>
## [2] <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>

This XPath expression is intended to select all exprs, then select its child of LEFT_ASSIGN, and select its preceding sibling of expr (before the LEFT_ASSIGN) and then select the SYMBOL node in it. All XML nodes that are compatible with this XPath expression will appear in the result set and the result set only contains such nodes.

The power of XPath should be obvious now: we could optionally specify predicates at any level (via []), and we could also use XPath axis to specify the relationships between the context (current node) with nodes at nested levels (via e.g. / and //) or at the same level (via e.g. preceding-sibling, following-sibling)

Now consider this problem: Find symbols we defined with an expression where a function is called.

Obviously, in the parsed R expressions, the first expression n <- 10 does obviously not satisfy this condition since no function call is involved in the calculation of its value 10. By contrast, the second expression y <- rnorm(n, mean = 1) does apparently satisfy the condition as rnorm() is the function call we are looking for.

How could we translate such a request to a XPath expression? The following is an example that works:

xml_find_all(xml,
  "//expr/LEFT_ASSIGN[following-sibling::expr//SYMBOL_FUNCTION_CALL]
  /preceding-sibling::expr[1]/SYMBOL")

## {xml_nodeset (1)}
## [1] <SYMBOL line1="3" col1="1" line2="3" col2="1" start="73" end="73">y</SYMBOL>

Now we get an impression how the static R code analysis could be done via working with the parse data represented in either data.frame or XML document.

Following are some more practical code I wrote using XML and XPath to do code analysis:

missing_package_linter added to lintr to detect library() calls with packages that are not installed.
namespace_linter added to lintr to detect invalid namespace and exported/non-exported objects accessed via :: and :::.
Go-to-definition implemented in languageserver to find the definition of a symbol with scoping rules taken into account.
Scope completion implemented in languageserver to find the symbols that are accessible in a particular cursur position with scoping rules taken into account.

With some basic knowledge of static analysis of R code, we can write our own linters that check the code style or quality according to our needs or standards.

If you are interested in XML and XPath, I recommend that you take a look at the following tutorials or online tools: