The data.table Format

The data.table format also helps shorten code when working with data.frame structures. Most importantly, data.table handles big data very efficiently. You can convert a data.frame to data.table and back if needed.

Basics

a) What is data.table?

data.table is an R package that provides an enhanced version of data.frames, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread(). We can also create one using the data.table() function. Here is an example:

DT = data.table( 
  ID = c("b","b","b","a","a","c"), 
  a = 1:6, 
  b = 7:12, 
  c = 13:18 
) 
DT 
#    ID a  b  c 
# 1:  b 1  7 13 
# 2:  b 2  8 14 
# 3:  b 3  9 15 
# 4:  a 4 10 16 
# 5:  a 5 11 17 
# 6:  c 6 12 18 
class(DT$ID) 
# [1] "character" 

You can also convert existing objects to a data.table using setDT() (for data.frames and lists) and as.data.table() (for other structures); the difference is beyond the scope of this vignette, see ?setDT and ?as.data.table for more details.

Note that:

  • Unlike data.frames, columns of character type are never converted to factors by default.

  • Row numbers are printed with a : in order to visually separate the row number from the first column.

  • When the number of rows to print exceeds the global option datatable.print.nrows (default = 100), it automatically prints only the top 5 and bottom 5 rows (as can be seen in the Data section). If you've had a lot of experience with data.frames, you may have found yourself waiting around while larger tables print-and-page, sometimes seemingly endlessly. You can query the default number like so:

    getOption("datatable.print.nrows") 
  • data.table doesn't set or use row names, ever. We will see why in the "Keys and fast binary search based subset" vignette.

b) General form - in what way is a data.table enhanced?

In contrast to a data.frame, you can do a lot more than just subsetting rows and selecting columns within the frame of a data.table, i.e., within [ ... ] (NB: we might also refer to writing things inside DT[...] as "querying DT", in analogy to SQL). To understand it we will have to first look at the general form of data.table syntax, as shown below:

DT[i, j, by] 
##   R:                 i                 j        by 
## SQL:  where | order by   select | update  group by 

Users who have an SQL background might perhaps immediately relate to this syntax.

The way to read it (out loud) is:

Take DT, subset/reorder rows using i, then calculate j, grouped by by.

Let's begin by looking at i and j first - subsetting rows and operating on columns.

c) Subset rows in i

– Get all the flights with "JFK" as the origin airport in the month of June.

ans <- flights[origin == "JFK" & month == 6L] 
head(ans) 
#    year month day dep_delay arr_delay carrier origin dest air_time distance hour 
# 1: 2014     6   1        -9        -5      AA    JFK  LAX      324     2475    8 
# 2: 2014     6   1       -10       -13      AA    JFK  LAX      329     2475   12 
# 3: 2014     6   1        18        -1      AA    JFK  LAX      326     2475    7 
# 4: 2014     6   1        -6       -16      AA    JFK  LAX      320     2475   10 
# 5: 2014     6   1        -4       -45      AA    JFK  LAX      326     2475   18 
# 6: 2014     6   1        -6       -23      AA    JFK  LAX      329     2475   14 
  • Within the frame of a data.table, columns can be referred to as if they are variables, much like in SQL or Stata. Therefore, we simply refer to origin and month as if they are variables. We do not need to add the prefix flights$ each time. Nevertheless, using flights$origin and flights$month would work just fine.

  • The row indices that satisfy the condition origin == "JFK" & month == 6L are computed, and since there is nothing else left to do, all columns from flights at rows corresponding to those row indices are simply returned as a data.table.

  • A comma after the condition in i is not required. But flights[origin == "JFK" & month == 6L, ] would work just fine. In data.frames, however, the comma is necessary.

– Get the first two rows from flights.

ans <- flights[1:2] 
ans 
#    year month day dep_delay arr_delay carrier origin dest air_time distance hour 
# 1: 2014     1   1        14        13      AA    JFK  LAX      359     2475    9 
# 2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475   11 
  • In this case, there is no condition. The row indices are already provided in i. We therefore return a data.table with all columns from flights at rows for those row indices.

– Sort flights first by column origin in ascending order, and then by dest in descending order:

We can use the R function order() to accomplish this.

ans <- flights[order(origin, -dest)] 
head(ans) 
#    year month day dep_delay arr_delay carrier origin dest air_time distance hour 
# 1: 2014     1   5         6        49      EV    EWR  XNA      195     1131    8 
# 2: 2014     1   6         7        13      EV    EWR  XNA      190     1131    8 
# 3: 2014     1   7        -6       -13      EV    EWR  XNA      179     1131    8 
# 4: 2014     1   8        -7       -12      EV    EWR  XNA      184     1131    8 
# 5: 2014     1   9        16         7      EV    EWR  XNA      181     1131    8 
# 6: 2014     1  13        66        66      EV    EWR  XNA      188     1131    9 

order() is internally optimised

  • We can use "-" on a character columns within the frame of a data.table to sort in decreasing order.

  • In addition, order(...) within the frame of a data.table uses data.table's internal fast radix order forder(). This sort provided such a compelling improvement over R's base::order that the R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0, see ?sort and the R Release NEWS.

We will discuss data.table's fast order in more detail in the data.table internals vignette.


d) Select column(s) in j

– Select arr_delay column, but return it as a vector.

ans <- flights[, arr_delay] 
head(ans) 
# [1]  13  13   9 -26   1   0 
  • Since columns can be referred to as if they are variables within the frame of data.tables, we directly refer to the variable we want to subset. Since we want all the rows, we simply skip i.

  • It returns all the rows for the column arr_delay.

– Select arr_delay column, but return as a data.table instead.

ans <- flights[, list(arr_delay)] 
head(ans) 
#    arr_delay 
# 1:        13 
# 2:        13 
# 3:         9 
# 4:       -26 
# 5:         1 
# 6:         0 
  • We wrap the variables (column names) within list(), which ensures that a data.table is returned. In case of a single column name, not wrapping with list() returns a vector instead, as seen in the previous example.

  • data.table also allows wrapping columns with .() instead of list(). It is an alias to list(); they both mean the same. Feel free to use whichever you prefer; we have noticed most users seem to prefer .() for conciseness, so we will continue to use .() hereafter.

data.tables (and data.frames) are internally lists as well, with the stipulation that each element has the same length and the list has a class attribute. Allowing j to return a list enables converting and returning data.table very efficiently.

Tip:

As long as j-expression returns a list, each element of the list will be converted to a column in the resulting data.table. This makes j quite powerful, as we will see shortly. It is also very important to understand this for when you'd like to make more complicated queries!!

– Select both arr_delay and dep_delay columns.

ans <- flights[, .(arr_delay, dep_delay)] 
head(ans) 
#    arr_delay dep_delay 
# 1:        13        14 
# 2:        13        -3 
# 3:         9         2 
# 4:       -26        -8 
# 5:         1         2 
# 6:         0         4 
## alternatively 
# ans <- flights[, list(arr_delay, dep_delay)] 
  • Wrap both columns within .(), or list(). That's it.

– Select both arr_delay and dep_delay columns and rename them to delay_arr and delay_dep.

Since .() is just an alias for list(), we can name columns as we would while creating a list.

ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)] 
head(ans) 
#    delay_arr delay_dep 
# 1:        13        14 
# 2:        13        -3 
# 3:         9         2 
# 4:       -26        -8 
# 5:         1         2 
# 6:         0         4 

That's it.


e) Compute or do in j

– How many trips have had total delay < 0?

ans <- flights[, sum( (arr_delay + dep_delay) < 0 )] 
ans 
# [1] 141814 

What's happening here?

  • data.table's j can handle more than just selecting columns - it can handle expressions, i.e., computing on columns. This shouldn't be surprising, as columns can be referred to as if they are variables. Then we should be able to compute by calling functions on those variables. And that's what precisely happens here.


f) Subset in i and do in j

– Calculate the average arrival and departure delay for all flights with "JFK" as the origin airport in the month of June.

ans <- flights[origin == "JFK" & month == 6L, 
               .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))] 
ans 
#       m_arr    m_dep 
# 1: 5.839349 9.807884 
  • We first subset in i to find matching row indices where origin airport equals "JFK", and month equals 6L. We do not subset the entire data.table corresponding to those rows yet.

  • Now, we look at j and find that it uses only two columns. And what we have to do is to compute their mean(). Therefore we subset just those columns corresponding to the matching rows, and compute their mean().

Because the three main components of the query (i, j and by) are together inside [...], data.table can see all three and optimise the query altogether before evaluation, not each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns besides arr_delay and dep_delay), for both speed and memory efficiency.

– How many trips have been made in 2014 from "JFK" airport in the month of June?

ans <- flights[origin == "JFK" & month == 6L, length(dest)] 
ans 
# [1] 8422 

The function length() requires an input argument. We just needed to compute the number of rows in the subset. We could have used any other column as input argument to length() really. This approach is reminiscent of SELECT COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6 in SQL.

This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where data.table provides a special symbol .N for it.

Special symbol .N:

.N is a special built-in variable that holds the number of observations in the current group. It is particularly useful when combined with by as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.

So we can now accomplish the same task by using .N as follows:

ans <- flights[origin == "JFK" & month == 6L, .N] 
ans 
# [1] 8422 
  • Once again, we subset in i to get the row indices where origin airport equals "JFK", and month equals 6.

  • We see that j uses only .N and no other columns. Therefore the entire subset is not materialised. We simply return the number of rows in the subset (which is just the length of row indices).

  • Note that we did not wrap .N with list() or .(). Therefore a vector is returned.

We could have accomplished the same operation by doing nrow(flights[origin == "JFK" & month == 6L]). However, it would have to subset the entire data.table first corresponding to the row indices in i and then return the rows using nrow(), which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the data.table design vignette.


g) Great! But how can I refer to columns by names in j (like in a data.frame)?

If you're writing out the column names explicitly, there's no difference vis-a-vis data.frame (since v1.9.8).

– Select both arr_delay and dep_delay columns the data.frame way.

ans <- flights[, c("arr_delay", "dep_delay")] 
head(ans) 
#    arr_delay dep_delay 
# 1:        13        14 
# 2:        13        -3 
# 3:         9         2 
# 4:       -26        -8 
# 5:         1         2 
# 6:         0         4 

If you've stored the desired columns in a character vector, there are two options: Using the .. prefix, or using the with argument.

– Select columns named in a variable using the .. prefix

select_cols = c("arr_delay", "dep_delay") 
flights[ , ..select_cols] 
#         arr_delay dep_delay 
#      1:        13        14 
#      2:        13        -3 
#      3:         9         2 
#      4:       -26        -8 
#      5:         1         2 
#     ---                     
# 253312:       -30         1 
# 253313:       -14        -5 
# 253314:        16        -8 
# 253315:        15        -4 
# 253316:         1        -5 

For those familiar with the Unix terminal, the .. prefix should be reminiscent of the "up-one-level" command, which is analogous to what's happening here – the .. signals to data.table to look for the select_cols variable "up-one-level", i.e., in the global environment in this case.

– Select columns named in a variable using with = FALSE

flights[ , select_cols, with = FALSE] 
#         arr_delay dep_delay 
#      1:        13        14 
#      2:        13        -3 
#      3:         9         2 
#      4:       -26        -8 
#      5:         1         2 
#     ---                     
# 253312:       -30         1 
# 253313:       -14        -5 
# 253314:        16        -8 
# 253315:        15        -4 
# 253316:         1        -5 

The argument is named with after the R function with() because of similar functionality. Suppose you have a data.frame DF and you'd like to subset all rows where x > 1. In base R you can do the following:

DF = data.frame(x = c(1,1,1,2,2,3,3,3), y = 1:8) 
 
## (1) normal way 
DF[DF$x > 1, ] # data.frame needs that ',' as well 
#   x y 
# 4 2 4 
# 5 2 5 
# 6 3 6 
# 7 3 7 
# 8 3 8 
## (2) using with 
DF[with(DF, x > 1), ] 
#   x y 
# 4 2 4 
# 5 2 5 
# 6 3 6 
# 7 3 7 
# 8 3 8 
  • Using with() in (2) allows using DF's column x as if it were a variable.

    Hence the argument name with in data.table. Setting with = FALSE disables the ability to refer to columns as if they are variables, thereby restoring the "data.frame mode".

  • We can also deselect columns using - or !. For example:

    ## not run 
    # returns all columns except arr_delay and dep_delay 
    ans <- flights[, !c("arr_delay", "dep_delay")] 
    # or 
    ans <- flights[, -c("arr_delay", "dep_delay")] 
  • From v1.9.5+, we can also select by specifying start and end column names, e.g., year:day to select the first three columns.

    ## not run 
    # returns year,month and day 
    ans <- flights[, year:day] 
    # returns day, month and year 
    ans <- flights[, day:year] 
    # returns all columns except year, month and day 
    ans <- flights[, -(year:day)] 
    ans <- flights[, !(year:day)] 

    This is particularly handy while working interactively.

with = TRUE is the default in data.table because we can do much more by allowing j to handle expressions - especially when combined with by, as we'll see in a moment.