Formulas and Model Families

Formulas are the R versions of statistical equations passed to the R functions for estimation. We use formulas to specify the models, such as what terms the model will have and their transformations. This section introduces various options for specifying a model using formulas. Pay attention to specifying the intercept and interactions of variables.

Interactions (two continuous)

Let's take a look at the equivalent model for two continuous variables. Initially things proceed almost identically to the previous example:

mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)

grid <- sim4 %>% 
data_grid(
  x1 = seq_range(x1, 5), 
  x2 = seq_range(x2, 5) 
) %>% 
gather_predictions(mod1, mod2)
grid
#> # A tibble: 50 x 4
#>   model    x1    x2   pred
#>   <chr> <dbl> <dbl>  <dbl>
#> 1 mod1   -1    -1    0.996
#> 2 mod1   -1    -0.5 -0.395
#> 3 mod1   -1     0   -1.79 
#> 4 mod1   -1     0.5 -3.18 
#> 5 mod1   -1     1   -4.57 
#> 6 mod1   -0.5  -1    1.91 
#> # … with 44 more rows

Note my use of seq_range() inside data_grid(). Instead of using every unique value of x, I'm going to use a regularly spaced grid of five values between the minimum and maximum numbers. It's probably not super important here, but it's a useful technique in general. There are two other useful arguments to seq_range():

  • pretty = TRUE will generate a "pretty" sequence, i.e. something that looks nice to the human eye. This is useful if you want to produce tables of output:

    seq_range(c(0.0123, 0.923423), n = 5)
    #> [1] 0.0123000 0.2400808 0.4678615 0.6956423 0.9234230
    seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
    #> [1] 0.0 0.2 0.4 0.6 0.8 1.0
  • trim = 0.1 will trim off 10% of the tail values. This is useful if the variables have a long tailed distribution and you want to focus on generating values near the center:

    x1 <- rcauchy(100)
    seq_range(x1, n = 5)
    #> [1] -115.86934  -83.52130  -51.17325  -18.82520   13.52284
    seq_range(x1, n = 5, trim = 0.10)
    #> [1] -13.841101  -8.709812  -3.578522   1.552767   6.684057
    seq_range(x1, n = 5, trim = 0.25)
    #> [1] -2.17345439 -1.05938856  0.05467728  1.16874312  2.28280896
    seq_range(x1, n = 5, trim = 0.50)
    #> [1] -0.7249565 -0.2677888  0.1893788  0.6465465  1.1037141
  • expand = 0.1 is in some sense the opposite of trim() it expands the range by 10%.

    x2 <- c(0, 1)
    seq_range(x2, n = 5)
    #> [1] 0.00 0.25 0.50 0.75 1.00
    seq_range(x2, n = 5, expand = 0.10)
    #> [1] -0.050  0.225  0.500  0.775  1.050
    seq_range(x2, n = 5, expand = 0.25)
    #> [1] -0.1250  0.1875  0.5000  0.8125  1.1250
    seq_range(x2, n = 5, expand = 0.50)
    #> [1] -0.250  0.125  0.500  0.875  1.250

Next let's try and visualise that model. We have two continuous predictors, so you can imagine the model like a 3d surface. We could display that using geom_tile():

ggplot(grid, aes(x1, x2)) + 
geom_tile(aes(fill = pred)) + 
facet_wrap(~ model)

That doesn't suggest that the models are very different! But that's partly an illusion: our eyes and brains are not very good at accurately comparing shades of colour. Instead of looking at the surface from the top, we could look at it from either side, showing multiple slices:

ggplot(grid, aes(x1, pred, colour = x2, group = x2)) + 
geom_line() +
facet_wrap(~ model)
ggplot(grid, aes(x2, pred, colour = x1, group = x1)) + 
geom_line() +
facet_wrap(~ model)


This shows you that interaction between two continuous variables works basically the same way as for a categorical and continuous variable. An interaction says that there's not a fixed offset: you need to consider both values of x1 and x2 simultaneously in order to predict y.

You can see that even with just two continuous variables, coming up with good visualisations are hard. But that's reasonable: you shouldn't expect it will be easy to understand how three or more variables simultaneously interact! But again, we're saved a little because we're using models for exploration, and you can gradually build up your model over time. The model doesn't have to be perfect, it just has to help you reveal a little more about your data.

I spent some time looking at the residuals to see if I could figure if mod2 did better than mod1. I think it does, but it's pretty subtle. You'll have a chance to work on it in the exercises.