Formulas and Model Families

Formulas are the R versions of statistical equations passed to the R functions for estimation. We use formulas to specify the models, such as what terms the model will have and their transformations. This section introduces various options for specifying a model using formulas. Pay attention to specifying the intercept and interactions of variables.

Interactions (continuous and categorical)

What happens when you combine a continuous and a categorical variable? sim3 contains a categorical predictor and a continuous predictor. We can visualise it with a simple plot:

ggplot(sim3, aes(x1, y)) + 
geom_point(aes(colour = x2))

There are two possible models you could fit to this data:

mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)

When you add variables with +, the model will estimate each effect independent of all the others. It's possible to fit the so-called interaction by using *. For example, y ~ x1 * x2 is translated to y = a_0 + a_1 * x1 + a_2 * x2 + a_12 * x1 * x2. Note that whenever you use *, both the interaction and the individual components are included in the model.

To visualise these models we need two new tricks:

  1. We have two predictors, so we need to give data_grid() both variables. It finds all the unique values of x1 and x2 and then generates all combinations.

  2. To generate predictions from both models simultaneously, we can use gather_predictions() which adds each prediction as a row. The complement of gather_predictions() is spread_predictions() which adds each prediction to a new column.

Together this gives us:

grid <- sim3 %>% 
data_grid(x1, x2) %>% 
gather_predictions(mod1, mod2)
grid
#> # A tibble: 80 x 4
#>   model    x1 x2     pred
#>   <chr> <int> <fct> <dbl>
#> 1 mod1      1 a      1.67
#> 2 mod1      1 b      4.56
#> 3 mod1      1 c      6.48
#> 4 mod1      1 d      4.03
#> 5 mod1      2 a      1.48
#> 6 mod1      2 b      4.37
#> # … with 74 more rows

We can visualise the results for both models on one plot using facetting:

ggplot(sim3, aes(x1, y, colour = x2)) + 
geom_point() + 
geom_line(data = grid, aes(y = pred)) + 
facet_wrap(~ model)

Note that the model that uses + has the same slope for each line, but different intercepts. The model that uses * has a different slope and intercept for each line.

Which model is better for this data? We can take look at the residuals. Here I've facetted by both model and x2 because it makes it easier to see the pattern within each group.

sim3 <- sim3 %>% 
gather_residuals(mod1, mod2)

ggplot(sim3, aes(x1, resid, colour = x2)) + 
geom_point() + 
facet_grid(model ~ x2)

There is little obvious pattern in the residuals for mod2. The residuals for mod1 show that the model has clearly missed some pattern in b, and less so, but still present is pattern in c, and d. You might wonder if there's a precise way to tell which of mod1 or mod2 is better. There is, but it requires a lot of mathematical background, and we don't really care. Here, we're interested in a qualitative assessment of whether or not the model has captured the pattern that we're interested in.