In both research and application, we need to manipulate data frames by selecting desired columns, filtering records, transforming and aggregating data.
R provides built-in functions for data frame manipulation. Suppose df
is the data frame we are dealing with. We use df[1:100,]
to select the first 100 rows, df[,c("price","volume")]
to select price
and volume
columns, df[df$price >= mean(df$price),]
to single out records with prices no less than their average, transform(df, totalValue=price*volume)
to add a new column totalValue
for each record, apply(df,2,mean)
to calculate the mean of each column.
However, if we want to do something more, together, the R code will be totally a mess. Say we want to sort df
by a new column totalValue
, which equals price
times volume
, and then average the price
and totalValue
columns for the top 20 records. The R code, if written in several lines, can be this:
df$totalValue <- df$price * df$volume
df.sorted <- df[order(df$totalValue,decreasing=T),]
df.subset <- df.sorted[1:20,c("price","totalValue")]
apply(df.subset,2,mean)
The above code introduces several intermediary variables. If we want to do more with the built-in functions, the code will quickly be unreadable.
In the realm of database, people already have a decent solution to regular data manipulation. It is a language called SQL (Structured Query Language). You can write SQL as English-like statements to query the database in order to finish all tasks mentioned above, including doing much more complex tasks, together, even in one line. SQL boosts how people interact with data. If you are not familiar with it, you may learn some essentials here.
sqldf
package provides an interface between SQLite memory database and R through SQL. You even don’t need to know database stuff since this package makes everything happen in the back-stage. With this package, you only need to call sqldf()
to query a data frame in the current environment. Here you only need to know some basic SQL to begin with.
For example, if you need to select price
and volume
columns, the SQL statement should be:
SELECT price, volume FROM df
The statement looks very much like an English sentence. It’s true, and it is exactly one of the purposes for which SQL was designed. In R, sqldf
allows us to directly query df
by calling sqldf()
function:
sqldf("SELECT price, volume FROM df")
To introduce a new colume totalValue
, the SQL statement should be:
SELECT *, price * volume AS totalValue FROM df
In fact, SQL keywords (e.g. SELECT
, AS
, FROM
) are not case sensitive. You can write lower capital counterparts instead. However, R variable names are case sensitive, so you need to be careful with the cases of data frame columns.
Filtering can be very easy too. Here we use WHERE
to select records where totalValue
is no less than 3000.
SELECT *, price * volume AS totalValue FROM df WHERE totalValue >= 3000
Sorting can also be simple. Here we use ORDER BY
to sort the records by totalValue
in a descending way.
SELECT *, price * volume AS totalValue FROM df ORDER BY totalValue DESC
The code for subsetting a table is also intuitive. Here we use LIMIT
to select only the top 30 records with the highest totalValue
.
SELECT *, price * volume AS totalValue
FROM df
ORDER BY totalValue DESC
LIMIT 30
Note that we break the lines to make the statement clear. It works perfectly in the same way as a statement without line breaks.
The power of SQL may not be very clear yet, unless we combine them together. For example, if we want to finish all the tasks in the first paragraph in one SQL statement, here it is:
SELECT AVG(price), AVG(totalValue)
FROM
(SELECT *, price * volume AS totalValue
FROM df
ORDER BY totalValue DESC
LIMIT 20)
Here we embed a SQL statement inside another.
Another example is to select the top 100 records ordered by totalValue
in descending way where their prices are no less than the average price.
SELECT *, price * volume AS totalValue
FROM df
WHERE price >= (SELECT AVG(price) FROM df)
ORDER BY totalValue DESC
LIMIT 100
If you are familiar with SQL, the statement above is almost as friendly as plain English, and it does not matter whether we write it in one line or in several lines. Here we separate the different clauses in the statement for greater readability.
You may try to implement it only by built-in R functions and you will certainly find SQL a very powerful tool. Here I should remark that sqldf
is based on SQLite memory database and provides its select
functionality. Since different database engines support the standard of SQL to a different degree, we are only allowed to use the SQL-SELECT statements within the support of SQLite database engine. You may get more information here.
In conclusion, SQL is a powerful tool so that R users should pick it up. And sqldf
is the way we use this language with R to operate data frame in a more decent way. sqldf
is listed on CRAN and is hosted by GitHub. Its official website offers a comprehensive tutorial and frequently asked questions.