## Formulas and Model Families

Formulas are the R versions of statistical equations passed to the R functions for estimation. We use formulas to specify the models, such as what terms the model will have and their transformations. This section introduces various options for specifying a model using formulas. Pay attention to specifying the intercept and interactions of variables.

### Introduction

You've seen formulas before when using `facet_wrap()`

and `facet_grid()`

.
In R, formulas provide a general way of getting "special behaviour".
Rather than evaluating the values of the variables right away, they
capture them so they can be interpreted by the function.

The majority of modelling functions in R use a standard conversion
from formulas to functions. You've seen one simple conversion already: `y ~ x`

is translated to `y = a_1 + a_2 * x`

. If you want to see what R actually does, you can use the `model_matrix()`

function. It takes a data frame and a formula and returns a tibble that
defines the model equation: each column in the output is associated
with one coefficient in the model, the function is always `y = a_1 * out1 + a_2 * out_2`

. For the simplest case of `y ~ x1`

this shows us something interesting:

df <- tribble( ~y, ~x1, ~x2, 4, 2, 5, 5, 1, 6 ) model_matrix(df, y ~ x1) #> # A tibble: 2 x 2 #> `(Intercept)` x1 #> <dbl> <dbl> #> 1 1 2 #> 2 1 1

The
way that R adds the intercept to the model is just by having a column
that is full of ones. By default, R will always add this column. If you
don't want, you need to explicitly drop it with `-1`

:

model_matrix(df, y ~ x1 - 1) #> # A tibble: 2 x 1 #> x1 #> <dbl> #> 1 2 #> 2 1

The model matrix grows in an unsurprising way when you add more variables to the the model:

model_matrix(df, y ~ x1 + x2) #> # A tibble: 2 x 3 #> `(Intercept)` x1 x2 #> <dbl> <dbl> <dbl> #> 1 1 2 5 #> 2 1 1 6

This formula notation is sometimes called "Wilkinson-Rogers notation", and was initially described in *Symbolic Description of Factorial Models for Analysis of Variance*, by G. N. Wilkinson and C. E. Rogers https://www.jstor.org/stable/2346786. It's worth digging up and reading the original paper if you'd like to understand the full details of the modelling algebra.

The following sections expand on how this formula notation works for categorical variables, interactions, and transformation.

Source: H. Wickham and G. Grolemund, https://r4ds.had.co.nz/model-basics.html

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.