February 15, 2014

A principle of writing robust R program

Writing R code can be very easy. It depends on how much you want to achieve with your code and what features you want your code to support.

To test a random thought that needs some statistical evidence, you only need to casually import data, slightly transform the data to a necessary form, and perform some statistical tests and see the conclusions. You don't need to create a project, model the project structure, design the features, and implement the code from one stage to another.

However, if you are working on a R project that has the potential to grow in size and complexity, you had better consider a principle while creating the first prototype. The principle is Make different components inpedendent.

For a typical project that involves statistical computing, writing code is actually designing a machine. Although it's not a physical machine full of gears and wheels, it runs mechanically according to the logic we specify in the code. For a machine, we usually don't want it to work only once. Rather, we often want to rerun the machine with different sets of input and see what output it produces. To make it easier, it is better to identify three elementary components in your project: Logic, data, and input.

Logic is how the program runs and is specified in the code we write. To compare, the logic of a machine is the way different gears and wheels interact with each other. Similarly, in the program, the logic is the way different objects interact with each other. Just like road and fuel are not built-in components of a car, data and input should not be included by the logic. Actually, a robust programming logic is just how data and input should interact with each other, and nothing more.

If this principle is implemented, that is, logic, data, and input are separated, the program will serve as a much more handy machine. For example, if you want to run the program with a different set of inputs, you only need to create a new profile of inputs and run it without changing the internal logic and external data, just like pressing some different buttons on the board and excuting a microwave without reconnecting the wires and cables inside.

If the logic is independent from the data, it will be very convenient to run over another data set with the same structure. For example, if your code refers to some special column names in a specific data set, the code won't be so robust that the data set can easily be altered to another one. Here we should specify the column names somewhere in the input, and refer to a unified version of column names in the logic. Then the code becomes much more robust.

To make it easier to implement this principle, I recommend using JSON to encode inputs. JSON is the abbreviation for JavaScript Object Notation. But here it has nothing to do with JavaScript. We only use its syntax for easy specification of a set of inputs.

For example, a simple set of JSON inputs can be written like

{
    "name": "test1",
    "n": 20,
    "random": ["rnorm", "runif", "rnorm"],
    "range": [0.2, 0.8],
    "columns": ["a","b","c"]
}

The above text defines 5 fields: name as a string, n as an integer, random as a string vector, range as a float vector, and columns as another string vector.

Multiple packages are designed to implement reading JSON in R. I recommend jsonlite package, which allows us to directly read a JSON text or a JSON file directly calling fromJSON.

Here is a small example that illustrates how handy it is to operate a machine whose logic, inputs, and data are independent. Suppose we want to build a program that produces a data frame with a number of columns, each of which has a prespecified name and is a random numeric vector generated by different random generators. The random numbers are truncated within a certain range.

First, we write a simple JSON file (test1.json) to encode the default settings.

{
    "name": "test1",
    "n": 20,
    "random": ["rnorm", "runif", "rnorm"],
    "range": [0.2, 0.8],
    "columns": ["a","b","c"]
}

Then we write the R code as our logic to implement the idea.

require(jsonlite)
profile <- fromJSON("test1.json")
data <- lapply(profile$random,function(fun) {
  rnd <- get(fun)
  val <- rnd(n=profile$n)
  val[val < profile$range[1]] <- profile$range[1]
  val[val > profile$range[2]] <- profile$range[2]
  return(val)
})
df <- data.frame(do.call(cbind,data))
colnames(df) <- profile$columns

Here we call fromJSON function to load the settings, and call lapply to generate random numbers according to the specification in a robust way. The code above does not involve any piece of settings and data so that we are allowed to rerun the machine by different settings without having to change any bit of code.

In some situations, our program has more than one profiles. These profiles can be duplicates to each other except for a subset of settings. But if we want to change a field in each profile, it can be time-consuming. A decent way is to adopt profile overriding. To proceed, we first create a default profile (default.json) that defines the template.

{
    "name": "default",
    "n": 20,
    "random": ["rnorm", "runif", "rnorm"],
    "range": [0.2, 0.8],
    "columns": ["a","b","c"]
}

Then we create another overriding profile (test2.json) that only contains updates to the default one. For example:

{
    "name": "test2",
    "n": 50,
    "range": [0.4, 0.6]
}

To make it work, we use modifyList function to update the list created from default.json by that from test2.json.

require(jsonlite)
profile <- modifyList(fromJSON("default.json"),
                      fromJSON("test2.json"))
data <- lapply(profile$random,function(fun) {
  rnd <- get(fun)
  val <- rnd(n=profile$n)
  val[val < profile$range[1]] <- profile$range[1]
  val[val > profile$range[2]] <- profile$range[2]
  return(val)
})
df <- data.frame(do.call(cbind,data))
colnames(df) <- profile$columns

modifyList is a built-in function in R. It updates a list by merging updated fields and introducing new fields in the list and sublists recursively.

Now we may create several differnt set of settings and operate the machine in a very flexible way.

comments powered by Disqus