The data.table Format
The data.table format also helps shorten code when working with data.frame structures. Most importantly, data.table handles big data very efficiently. You can convert a data.frame to data.table and back if needed.
Aggregations
We've already seen i
and j
from data.table
's general form in the previous section. In this section, we'll see how they can be combined together with by
to perform operations by group. Let's look at some examples.
a) Grouping using by
– How can we get the number of trips corresponding to each origin airport?
ans <- flights[, .(.N), by = .(origin)] ans # origin N # 1: JFK 81483 # 2: LGA 84433 # 3: EWR 87400 ## or equivalently using a character vector in 'by' # ans <- flights[, .(.N), by = "origin"]
-
We know
.N
is a special variable that holds the number of rows in the current group. Grouping byorigin
obtains the number of rows,.N
, for each group. -
By doing
head(flights)
you can see that the origin airports occur in the order "JFK", "LGA" and "EWR". The original order of grouping variables is preserved in the result. This is important to keep in mind! -
Since we did not provide a name for the column returned in
j
, it was namedN
automatically by recognising the special symbol.N
. -
by
also accepts a character vector of column names. This is particularly useful for coding programmatically, e.g., designing a function with the grouping columns as a (character
vector) function argument. -
When there's only one column or expression to refer to in
j
andby
, we can drop the.()
notation. This is purely for convenience. We could instead do:ans <- flights[, .N, by = origin] ans # origin N # 1: JFK 81483 # 2: LGA 84433 # 3: EWR 87400
We'll use this convenient form wherever applicable hereafter.
– How can we calculate the number of trips for each origin airport for carrier code "AA"
?
The unique carrier code "AA"
corresponds to American Airlines Inc.
ans <- flights[carrier == "AA", .N, by = origin] ans # origin N # 1: JFK 11923 # 2: LGA 11730 # 3: EWR 2649
-
We first obtain the row indices for the expression
carrier == "AA"
fromi
. -
Using those row indices, we obtain the number of rows while grouped by
origin
. Once again no columns are actually materialised here, because thej-expression
does not require any columns to be actually subsetted and is therefore fast and memory efficient.
– How can we get the total number of trips for each origin, dest
pair for carrier code "AA"
?
ans <- flights[carrier == "AA", .N, by = .(origin, dest)] head(ans) # origin dest N # 1: JFK LAX 3387 # 2: LGA PBI 245 # 3: EWR LAX 62 # 4: JFK MIA 1876 # 5: JFK SEA 298 # 6: EWR MIA 848 ## or equivalently using a character vector in 'by' # ans <- flights[carrier == "AA", .N, by = c("origin", "dest")]
by
accepts multiple columns. We just provide all the columns by which to group by. Note the use of.()
again inby
– again, this is just shorthand forlist()
, andlist()
can be used here as well. Again, we'll stick with.()
in this vignette.
– How can we get the average arrival and departure delay for each orig,dest
pair for each month for carrier code "AA"
?
ans <- flights[carrier == "AA", .(mean(arr_delay), mean(dep_delay)), by = .(origin, dest, month)] ans # origin dest month V1 V2 # 1: JFK LAX 1 6.590361 14.2289157 # 2: LGA PBI 1 -7.758621 0.3103448 # 3: EWR LAX 1 1.366667 7.5000000 # 4: JFK MIA 1 15.720670 18.7430168 # 5: JFK SEA 1 14.357143 30.7500000 # --- # 196: LGA MIA 10 -6.251799 -1.4208633 # 197: JFK MIA 10 -1.880184 6.6774194 # 198: EWR PHX 10 -3.032258 -4.2903226 # 199: JFK MCO 10 -10.048387 -1.6129032 # 200: JFK DCA 10 16.483871 15.5161290
-
Since we did not provide column names for the expressions in
j
, they were automatically generated asV1
andV2
. -
Once again, note that the input order of grouping columns is preserved in the result.
Now what if we would like to order the result by those grouping columns origin
, dest
and month
?
b) Sorted by
: keyby
data.table
retaining the original order of groups is
intentional and by design. There are cases when preserving the original
order is essential. But at times we would like to automatically sort by
the variables in our grouping.
– So how can we directly order by all the grouping variables?
ans <- flights[carrier == "AA", .(mean(arr_delay), mean(dep_delay)), keyby = .(origin, dest, month)] ans # origin dest month V1 V2 # 1: EWR DFW 1 6.427673 10.0125786 # 2: EWR DFW 2 10.536765 11.3455882 # 3: EWR DFW 3 12.865031 8.0797546 # 4: EWR DFW 4 17.792683 12.9207317 # 5: EWR DFW 5 18.487805 18.6829268 # --- # 196: LGA PBI 1 -7.758621 0.3103448 # 197: LGA PBI 2 -7.865385 2.4038462 # 198: LGA PBI 3 -5.754098 3.0327869 # 199: LGA PBI 4 -13.966667 -4.7333333 # 200: LGA PBI 5 -10.357143 -6.8571429
- All we did was to change
by
tokeyby
. This automatically orders the result by the grouping variables in increasing order. In fact, due to the internal implementation ofby
first requiring a sort before recovering the original table's order,keyby
is typically faster thanby
because it doesn't require this second step.
Keys: Actually keyby
does a little more than just ordering. It also sets a key after ordering by setting an attribute
called sorted
.
We'll learn more about keys
in the Keys and fast binary search based subset vignette; for now, all you have to know is that you can use keyby
to automatically order the result by the columns specified in by
.
c) Chaining
Let's reconsider the task of getting the total number of trips for each origin, dest
pair for carrier "AA".
ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
– How can we order ans
using the columns origin
in ascending order, and dest
in descending order?
We can store the intermediate result in a variable, and then use order(origin, -dest)
on that variable. It seems fairly straightforward.
ans <- ans[order(origin, -dest)] head(ans) # origin dest N # 1: EWR PHX 121 # 2: EWR MIA 848 # 3: EWR LAX 62 # 4: EWR DFW 1618 # 5: JFK STT 229 # 6: JFK SJU 690
-
Recall that we can use
-
on acharacter
column inorder()
within the frame of adata.table
. This is possible to duedata.table
's internal query optimisation. -
Also recall that
order(...)
with the frame of adata.table
is automatically optimised to usedata.table
's internal fast radix orderforder()
for speed.
But this requires having to assign the intermediate result and then overwriting that result. We can do one better and avoid this intermediate assignment to a temporary variable altogether by chaining expressions.
ans <- flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)] head(ans, 10) # origin dest N # 1: EWR PHX 121 # 2: EWR MIA 848 # 3: EWR LAX 62 # 4: EWR DFW 1618 # 5: JFK STT 229 # 6: JFK SJU 690 # 7: JFK SFO 1312 # 8: JFK SEA 298 # 9: JFK SAN 299 # 10: JFK ORD 432
-
We can tack expressions one after another, forming a chain of operations, i.e.,
DT[ ... ][ ... ][ ... ]
. -
Or you can also chain them vertically:
DT[ ... ][ ... ][ ... ]
d) Expressions in by
– Can by
accept expressions as well or does it just take columns?
Yes it does. As an example, if we would like to find out how many flights started late but arrived early (or on time), started and arrived late etc…
ans <- flights[, .N, .(dep_delay>0, arr_delay>0)] ans # dep_delay arr_delay N # 1: TRUE TRUE 72836 # 2: FALSE TRUE 34583 # 3: FALSE FALSE 119304 # 4: TRUE FALSE 26593
-
The last row corresponds to
dep_delay > 0 = TRUE
andarr_delay > 0 = FALSE
. We can see that 26593 flights started late but arrived early (or on time). -
Note that we did not provide any names to
by-expression
. Therefore, names have been automatically assigned in the result. As withj
, you can name these expressions as you would elements of anylist
, e.g.DT[, .N, .(dep_delayed = dep_delay>0, arr_delayed = arr_delay>0)]
. -
You can provide other columns along with expressions, for example:
DT[, .N, by = .(a, b>0)]
.
e) Multiple columns in j
- .SD
– Do we have to compute mean()
for each column individually?
It is of course not practical to have to type mean(myCol)
for every column one by one. What if you had 100 columns to average mean()
?
How can we do this efficiently, concisely? To get there, refresh on this tip - "As long as the j
-expression returns a list
, each element of the list
will be converted to a column in the resulting data.table
". Suppose we can refer to the data subset for each group as a variable while grouping, then we can loop through all the columns of that variable using the already- or soon-to-be-familiar base function lapply()
. No new names to learn specific to data.table
.
Special symbol .SD
:
data.table
provides a special symbol, called .SD
. It stands for Subset of Data. It by itself is a data.table
that holds the data for the current group defined using by
.
Recall that a data.table
is internally a list
as well with all its columns of equal length.
Let's use the data.table
DT
from before to get a glimpse of what .SD
looks like.
DT # ID a b c # 1: b 1 7 13 # 2: b 2 8 14 # 3: b 3 9 15 # 4: a 4 10 16 # 5: a 5 11 17 # 6: c 6 12 18 DT[, print(.SD), by = ID] # a b c # 1: 1 7 13 # 2: 2 8 14 # 3: 3 9 15 # a b c # 1: 4 10 16 # 2: 5 11 17 # a b c # 1: 6 12 18 # Empty data.table (0 rows and 1 cols): ID
-
.SD
contains all the columns except the grouping columns by default. -
It is also generated by preserving the original order - data corresponding to
ID = "b"
, thenID = "a"
, and thenID = "c"
.
To compute on (multiple) columns, we can then simply use the base R function lapply()
.
DT[, lapply(.SD, mean), by = ID] # ID a b c # 1: b 2.0 8.0 14.0 # 2: a 4.5 10.5 16.5 # 3: c 6.0 12.0 18.0
-
.SD
holds the rows corresponding to columnsa
,b
andc
for that group. We compute themean()
on each of these columns using the already-familiar base functionlapply()
. -
Each group returns a list of three elements containing the mean value which will become the columns of the resulting
data.table
. -
Since
lapply()
returns alist
, so there is no need to wrap it with an additional.()
(if necessary, refer to this tip).
We are almost there. There is one little thing left to address. In our flights
data.table
, we only wanted to calculate the mean()
of two columns arr_delay
and dep_delay
. But .SD
would contain all the columns other than the grouping variables by default.
– How can we specify just the columns we would like to compute the mean()
on?
.SDcols
Using the argument .SDcols
. It accepts either column names or column indices. For example, .SDcols = c("arr_delay", "dep_delay")
ensures that .SD
contains only these two columns for each group.
Similar to part g), you can also provide the columns to remove instead of columns to keep using -
or !
sign as well as select consecutive columns as colA:colB
and deselect consecutive columns as !(colA:colB)
or -(colA:colB)
.
Now let us try to use .SD
along with .SDcols
to get the mean()
of arr_delay
and dep_delay
columns grouped by origin
, dest
and month
.
flights[carrier == "AA", ## Only on trips with carrier "AA" lapply(.SD, mean), ## compute the mean by = .(origin, dest, month), ## for every 'origin,dest,month' .SDcols = c("arr_delay", "dep_delay")] ## for just those specified in .SDcols # origin dest month arr_delay dep_delay # 1: JFK LAX 1 6.590361 14.2289157 # 2: LGA PBI 1 -7.758621 0.3103448 # 3: EWR LAX 1 1.366667 7.5000000 # 4: JFK MIA 1 15.720670 18.7430168 # 5: JFK SEA 1 14.357143 30.7500000 # --- # 196: LGA MIA 10 -6.251799 -1.4208633 # 197: JFK MIA 10 -1.880184 6.6774194 # 198: EWR PHX 10 -3.032258 -4.2903226 # 199: JFK MCO 10 -10.048387 -1.6129032 # 200: JFK DCA 10 16.483871 15.5161290
f) Subset .SD
for each group:
– How can we return the first two rows for each month
?
ans <- flights[, head(.SD, 2), by = month] head(ans) # month year day dep_delay arr_delay carrier origin dest air_time distance hour # 1: 1 2014 1 14 13 AA JFK LAX 359 2475 9 # 2: 1 2014 1 -3 13 AA JFK LAX 363 2475 11 # 3: 2 2014 1 -1 1 AA JFK LAX 358 2475 8 # 4: 2 2014 1 -5 3 AA JFK LAX 358 2475 11 # 5: 3 2014 1 -11 36 AA JFK LAX 375 2475 8 # 6: 3 2014 1 -3 14 AA JFK LAX 368 2475 11
-
.SD
is adata.table
that holds all the rows for that group. We simply subset the first two rows as we have seen here already. -
For each group,
head(.SD, 2)
returns the first two rows as adata.table
, which is also alist
, so we do not have to wrap it with.()
.
g) Why keep j
so flexible?
So that we have a consistent syntax and keep using already existing
(and familiar) base functions instead of learning new functions. To
illustrate, let us use the data.table
DT
that we created at the very beginning under What is a data.table? section.
– How can we concatenate columns a
and b
for each group in ID
?
DT[, .(val = c(a,b)), by = ID] # ID val # 1: b 1 # 2: b 2 # 3: b 3 # 4: b 7 # 5: b 8 # 6: b 9 # 7: a 4 # 8: a 5 # 9: a 10 # 10: a 11 # 11: c 6 # 12: c 12
- That's it. There is no special syntax required. All we need to know is the base function
c()
which concatenates vectors and the tip from before.
– What if we would like to have all the values of column a
and b
concatenated, but returned as a list column?
DT[, .(val = list(c(a,b))), by = ID] # ID val # 1: b 1,2,3,7,8,9 # 2: a 4, 5,10,11 # 3: c 6,12
-
Here, we first concatenate the values with
c(a,b)
for each group, and wrap that withlist()
. So for each group, we return a list of all concatenated values. -
Note those commas are for display only. A list column can contain any object in each cell, and in this example, each cell is itself a vector and some cells contain longer vectors than others.
Once you start internalising usage in j
, you will realise how powerful the syntax can be. A very useful way to understand it is by playing around, with the help of print()
.
For example:
## (1) look at the difference between DT[, print(c(a,b)), by = ID] # [1] 1 2 3 7 8 9 # [1] 4 5 10 11 # [1] 6 12 # Empty data.table (0 rows and 1 cols): ID ## (2) and DT[, print(list(c(a,b))), by = ID] # [[1]] # [1] 1 2 3 7 8 9 # # [[1]] # [1] 4 5 10 11 # # [[1]] # [1] 6 12 # Empty data.table (0 rows and 1 cols): ID
In (1), for each group, a vector is returned, with length = 6,4,2
here. However (2) returns a list of length 1 for each group, with its
first element holding vectors of length 6,4,2. Therefore (1) results in a
length of 6+4+2 = 12
, whereas (2) returns 1+1+1=3
.