Formulas and Model Families

Formulas are the R versions of statistical equations passed to the R functions for estimation. We use formulas to specify the models, such as what terms the model will have and their transformations. This section introduces various options for specifying a model using formulas. Pay attention to specifying the intercept and interactions of variables.

Introduction

You've seen formulas before when using facet_wrap() and facet_grid(). In R, formulas provide a general way of getting "special behaviour". Rather than evaluating the values of the variables right away, they capture them so they can be interpreted by the function.

The majority of modelling functions in R use a standard conversion from formulas to functions. You've seen one simple conversion already: y ~ x is translated to y = a_1 + a_2 * x. If you want to see what R actually does, you can use the model_matrix() function. It takes a data frame and a formula and returns a tibble that defines the model equation: each column in the output is associated with one coefficient in the model, the function is always y = a_1 * out1 + a_2 * out_2. For the simplest case of y ~ x1 this shows us something interesting:

df <- tribble(
~y, ~x1, ~x2,
4, 2, 5,
5, 1, 6
)
model_matrix(df, y ~ x1)
#> # A tibble: 2 x 2
#>   `(Intercept)`    x1
#>           <dbl> <dbl>
#> 1             1     2
#> 2             1     1

The way that R adds the intercept to the model is just by having a column that is full of ones. By default, R will always add this column. If you don't want, you need to explicitly drop it with -1:

model_matrix(df, y ~ x1 - 1)
#> # A tibble: 2 x 1
#>      x1
#>   <dbl>
#> 1     2
#> 2     1

The model matrix grows in an unsurprising way when you add more variables to the the model:

model_matrix(df, y ~ x1 + x2)
#> # A tibble: 2 x 3
#>   `(Intercept)`    x1    x2
#>           <dbl> <dbl> <dbl>
#> 1             1     2     5
#> 2             1     1     6

This formula notation is sometimes called "Wilkinson-Rogers notation", and was initially described in Symbolic Description of Factorial Models for Analysis of Variance, by G. N. Wilkinson and C. E. Rogers https://www.jstor.org/stable/2346786. It's worth digging up and reading the original paper if you'd like to understand the full details of the modelling algebra.

The following sections expand on how this formula notation works for categorical variables, interactions, and transformation.


Source: H. Wickham and G. Grolemund, https://r4ds.had.co.nz/model-basics.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.