The data.table Format
The data.table format also helps shorten code when working with data.frame structures. Most importantly, data.table handles big data very efficiently. You can convert a data.frame to data.table and back if needed.
Basics
a) What is data.table
?
data.table
is an R package that provides an enhanced version of data.frame
s, which are the standard data structure for storing data in base
R. In the Data section above, we already created a data.table
using fread()
. We can also create one using the data.table()
function. Here is an example:
DT = data.table( ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18 ) DT # ID a b c # 1: b 1 7 13 # 2: b 2 8 14 # 3: b 3 9 15 # 4: a 4 10 16 # 5: a 5 11 17 # 6: c 6 12 18 class(DT$ID) # [1] "character"
You can also convert existing objects to a data.table
using setDT()
(for data.frame
s and list
s) and as.data.table()
(for other structures); the difference is beyond the scope of this vignette, see ?setDT
and ?as.data.table
for more details.
Note that:
-
Unlike
data.frame
s, columns ofcharacter
type are never converted tofactors
by default. -
Row numbers are printed with a
:
in order to visually separate the row number from the first column. -
When the number of rows to print exceeds the global option
datatable.print.nrows
(default = 100), it automatically prints only the top 5 and bottom 5 rows (as can be seen in the Data section). If you've had a lot of experience withdata.frame
s, you may have found yourself waiting around while larger tables print-and-page, sometimes seemingly endlessly. You can query the default number like so:getOption("datatable.print.nrows")
-
data.table
doesn't set or use row names, ever. We will see why in the "Keys and fast binary search based subset" vignette.
b) General form - in what way is a data.table
enhanced?
In contrast to a data.frame
, you can do a lot more than just subsetting rows and selecting columns within the frame of a data.table
, i.e., within [ ... ]
(NB: we might also refer to writing things inside DT[...]
as "querying DT
", in analogy to SQL). To understand it we will have to first look at the general form of data.table
syntax, as shown below:
DT[i, j, by] ## R: i j by ## SQL: where | order by select | update group by
Users who have an SQL background might perhaps immediately relate to this syntax.
The way to read it (out loud) is:
Take DT
, subset/reorder rows using i
, then calculate j
, grouped by by
.
Let's begin by looking at i
and j
first - subsetting rows and operating on columns.
c) Subset rows in i
– Get all the flights with "JFK" as the origin airport in the month of June.
ans <- flights[origin == "JFK" & month == 6L] head(ans) # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 6 1 -9 -5 AA JFK LAX 324 2475 8 # 2: 2014 6 1 -10 -13 AA JFK LAX 329 2475 12 # 3: 2014 6 1 18 -1 AA JFK LAX 326 2475 7 # 4: 2014 6 1 -6 -16 AA JFK LAX 320 2475 10 # 5: 2014 6 1 -4 -45 AA JFK LAX 326 2475 18 # 6: 2014 6 1 -6 -23 AA JFK LAX 329 2475 14
-
Within the frame of a
data.table
, columns can be referred to as if they are variables, much like in SQL or Stata. Therefore, we simply refer toorigin
andmonth
as if they are variables. We do not need to add the prefixflights$
each time. Nevertheless, usingflights$origin
andflights$month
would work just fine. -
The row indices that satisfy the condition
origin == "JFK" & month == 6L
are computed, and since there is nothing else left to do, all columns fromflights
at rows corresponding to those row indices are simply returned as adata.table
. -
A comma after the condition in
i
is not required. Butflights[origin == "JFK" & month == 6L, ]
would work just fine. Indata.frame
s, however, the comma is necessary.
– Get the first two rows from flights
.
ans <- flights[1:2] ans # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 # 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
- In this case, there is no condition. The row indices are already provided in
i
. We therefore return adata.table
with all columns fromflights
at rows for those row indices.
– Sort flights
first by column origin
in ascending order, and then by dest
in descending order:
We can use the R function order()
to accomplish this.
ans <- flights[order(origin, -dest)] head(ans) # year month day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 2014 1 5 6 49 EV EWR XNA 195 1131 8 # 2: 2014 1 6 7 13 EV EWR XNA 190 1131 8 # 3: 2014 1 7 -6 -13 EV EWR XNA 179 1131 8 # 4: 2014 1 8 -7 -12 EV EWR XNA 184 1131 8 # 5: 2014 1 9 16 7 EV EWR XNA 181 1131 8 # 6: 2014 1 13 66 66 EV EWR XNA 188 1131 9
order()
is internally optimised
-
We can use "-" on a
character
columns within the frame of adata.table
to sort in decreasing order. -
In addition,
order(...)
within the frame of adata.table
usesdata.table
's internal fast radix orderforder()
. This sort provided such a compelling improvement over R'sbase::order
that the R project adopted thedata.table
algorithm as its default sort in 2016 for R 3.3.0, see?sort
and the R Release NEWS.
We will discuss data.table
's fast order in more detail in the data.table
internals vignette.
d) Select column(s) in j
– Select arr_delay
column, but return it as a vector.
ans <- flights[, arr_delay] head(ans) # [1] 13 13 9 -26 1 0
-
Since columns can be referred to as if they are variables within the frame of
data.table
s, we directly refer to the variable we want to subset. Since we want all the rows, we simply skipi
. -
It returns all the rows for the column
arr_delay
.
– Select arr_delay
column, but return as a data.table
instead.
ans <- flights[, list(arr_delay)] head(ans) # arr_delay # 1: 13 # 2: 13 # 3: 9 # 4: -26 # 5: 1 # 6: 0
-
We wrap the variables (column names) within
list()
, which ensures that adata.table
is returned. In case of a single column name, not wrapping withlist()
returns a vector instead, as seen in the previous example. -
data.table
also allows wrapping columns with.()
instead oflist()
. It is an alias tolist()
; they both mean the same. Feel free to use whichever you prefer; we have noticed most users seem to prefer.()
for conciseness, so we will continue to use.()
hereafter.
data.table
s (and data.frame
s) are internally list
s as well, with the stipulation that each element has the same length and the list
has a class
attribute. Allowing j
to return a list
enables converting and returning data.table
very efficiently.
Tip:
As long as j-expression
returns a list
, each element of the list will be converted to a column in the resulting data.table
. This makes j
quite powerful, as we will see shortly. It is also very important to
understand this for when you'd like to make more complicated queries!!
– Select both arr_delay
and dep_delay
columns.
ans <- flights[, .(arr_delay, dep_delay)] head(ans) # arr_delay dep_delay # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # 6: 0 4 ## alternatively # ans <- flights[, list(arr_delay, dep_delay)]
- Wrap both columns within
.()
, orlist()
. That's it.
– Select both arr_delay
and dep_delay
columns and rename them to delay_arr
and delay_dep
.
Since .()
is just an alias for list()
, we can name columns as we would while creating a list
.
ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)] head(ans) # delay_arr delay_dep # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # 6: 0 4
That's it.
e) Compute or do in j
– How many trips have had total delay < 0?
ans <- flights[, sum( (arr_delay + dep_delay) < 0 )] ans # [1] 141814
What's happening here?
data.table
'sj
can handle more than just selecting columns - it can handle expressions, i.e., computing on columns. This shouldn't be surprising, as columns can be referred to as if they are variables. Then we should be able to compute by calling functions on those variables. And that's what precisely happens here.
f) Subset in i
and do in j
– Calculate the average arrival and departure delay for all flights with "JFK" as the origin airport in the month of June.
ans <- flights[origin == "JFK" & month == 6L, .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))] ans # m_arr m_dep # 1: 5.839349 9.807884
-
We first subset in
i
to find matching row indices whereorigin
airport equals"JFK"
, andmonth
equals6L
. We do not subset the entiredata.table
corresponding to those rows yet. -
Now, we look at
j
and find that it uses only two columns. And what we have to do is to compute theirmean()
. Therefore we subset just those columns corresponding to the matching rows, and compute theirmean()
.
Because the three main components of the query (i
, j
and by
) are together inside [...]
, data.table
can see all three and optimise the query altogether before evaluation, not each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns besides arr_delay
and dep_delay
), for both speed and memory efficiency.
– How many trips have been made in 2014 from "JFK" airport in the month of June?
ans <- flights[origin == "JFK" & month == 6L, length(dest)] ans # [1] 8422
The function length()
requires an input argument. We
just needed to compute the number of rows in the subset. We could have
used any other column as input argument to length()
really. This approach is reminiscent of SELECT COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6
in SQL.
This type of operation occurs quite frequently, especially while
grouping (as we will see in the next section), to the point where data.table
provides a special symbol .N
for it.
Special symbol .N
:
.N
is a special built-in variable that holds the number of observations in the current group. It is particularly useful when combined with by
as we'll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.
So we can now accomplish the same task by using .N
as follows:
ans <- flights[origin == "JFK" & month == 6L, .N] ans # [1] 8422
-
Once again, we subset in
i
to get the row indices whereorigin
airport equals "JFK", andmonth
equals 6. -
We see that
j
uses only.N
and no other columns. Therefore the entire subset is not materialised. We simply return the number of rows in the subset (which is just the length of row indices). -
Note that we did not wrap
.N
withlist()
or.()
. Therefore a vector is returned.
We could have accomplished the same operation by doing nrow(flights[origin == "JFK" & month == 6L])
. However, it would have to subset the entire data.table
first corresponding to the row indices in i
and then return the rows using nrow()
, which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the data.table
design vignette.
g) Great! But how can I refer to columns by names in j
(like in a data.frame
)?
If you're writing out the column names explicitly, there's no difference vis-a-vis data.frame
(since v1.9.8).
– Select both arr_delay
and dep_delay
columns the data.frame
way.
ans <- flights[, c("arr_delay", "dep_delay")] head(ans) # arr_delay dep_delay # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # 6: 0 4
If you've stored the desired columns in a character vector, there are two options: Using the ..
prefix, or using the with
argument.
– Select columns named in a variable using the ..
prefix
select_cols = c("arr_delay", "dep_delay") flights[ , ..select_cols] # arr_delay dep_delay # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # --- # 253312: -30 1 # 253313: -14 -5 # 253314: 16 -8 # 253315: 15 -4 # 253316: 1 -5
For those familiar with the Unix terminal, the ..
prefix should be reminiscent of the "up-one-level" command, which is analogous to what's happening here – the ..
signals to data.table
to look for the select_cols
variable "up-one-level", i.e., in the global environment in this case.
– Select columns named in a variable using with = FALSE
flights[ , select_cols, with = FALSE] # arr_delay dep_delay # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # --- # 253312: -30 1 # 253313: -14 -5 # 253314: 16 -8 # 253315: 15 -4 # 253316: 1 -5
The argument is named with
after the R function with()
because of similar functionality. Suppose you have a data.frame
DF
and you'd like to subset all rows where x > 1
. In base
R you can do the following:
DF = data.frame(x = c(1,1,1,2,2,3,3,3), y = 1:8) ## (1) normal way DF[DF$x > 1, ] # data.frame needs that ',' as well # x y # 4 2 4 # 5 2 5 # 6 3 6 # 7 3 7 # 8 3 8 ## (2) using with DF[with(DF, x > 1), ] # x y # 4 2 4 # 5 2 5 # 6 3 6 # 7 3 7 # 8 3 8
-
Using
with()
in (2) allows usingDF
's columnx
as if it were a variable.Hence the argument name
with
indata.table
. Settingwith = FALSE
disables the ability to refer to columns as if they are variables, thereby restoring the "data.frame
mode". -
We can also deselect columns using
-
or!
. For example:## not run # returns all columns except arr_delay and dep_delay ans <- flights[, !c("arr_delay", "dep_delay")] # or ans <- flights[, -c("arr_delay", "dep_delay")]
-
From
v1.9.5+
, we can also select by specifying start and end column names, e.g.,year:day
to select the first three columns.## not run # returns year,month and day ans <- flights[, year:day] # returns day, month and year ans <- flights[, day:year] # returns all columns except year, month and day ans <- flights[, -(year:day)] ans <- flights[, !(year:day)]
This is particularly handy while working interactively.
with = TRUE
is the default in data.table
because we can do much more by allowing j
to handle expressions - especially when combined with by
, as we'll see in a moment.