The data.table Format

The data.table format also helps shorten code when working with data.frame structures. Most importantly, data.table handles big data very efficiently. You can convert a data.frame to data.table and back if needed.

Summary

The general form of data.table syntax is:

DT[i, j, by] 

We have seen so far that,


Using i:

  • We can subset rows similar to a data.frame- except you don't have to use DT$ repetitively since columns within the frame of a data.table are seen as if they are variables.

  • We can also sort a data.table using order(), which internally uses data.table's fast order for performance.

We can do much more in i by keying a data.table, which allows blazing fast subsets and joins. We will see this in the "Keys and fast binary search based subsets" and "Joins and rolling joins" vignette.


Using j:

  1. Select columns the data.table way: DT[, .(colA, colB)].

  2. Select columns the data.frame way: DT[, c("colA", "colB")].

  3. Compute on columns: DT[, .(sum(colA), mean(colB))].

  4. Provide names if necessary: DT[, .(sA =sum(colA), mB = mean(colB))].

  5. Combine with i: DT[colA > value, sum(colB)].

Using by:

  • Using by, we can group by columns by specifying a list of columns or a character vector of column names or even expressions. The flexibility of j, combined with by and i makes for a very powerful syntax.

  • by can handle multiple columns and also expressions.

  • We can keyby grouping columns to automatically sort the grouped result.

  • We can use .SD and .SDcols in j to operate on multiple columns using already familiar base functions. Here are some examples:

    1. DT[, lapply(.SD, fun), by = ..., .SDcols = ...] - applies fun to all columns specified in .SDcols while grouping by the columns specified in by.

    2. DT[, head(.SD, 2), by = ...] - return the first two rows for each group.

    3. DT[col > val, head(.SD, 1), by = ...] - combine i along with j and by.

And remember the tip:

As long as j returns a list, each element of the list will become a column in the resulting data.table.

We will see how to add/update/delete columns by reference and how to combine them with i and by in the next vignette.