Formulas and Model Families
Formulas are the R versions of statistical equations passed to the R functions for estimation. We use formulas to specify the models, such as what terms the model will have and their transformations. This section introduces various options for specifying a model using formulas. Pay attention to specifying the intercept and interactions of variables.
Interactions (two continuous)
Let's take a look at the equivalent model for two continuous variables. Initially things proceed almost identically to the previous example:
mod1 <- lm(y ~ x1 + x2, data = sim4) mod2 <- lm(y ~ x1 * x2, data = sim4) grid <- sim4 %>% data_grid( x1 = seq_range(x1, 5), x2 = seq_range(x2, 5) ) %>% gather_predictions(mod1, mod2) grid #> # A tibble: 50 x 4 #> model x1 x2 pred #> <chr> <dbl> <dbl> <dbl> #> 1 mod1 -1 -1 0.996 #> 2 mod1 -1 -0.5 -0.395 #> 3 mod1 -1 0 -1.79 #> 4 mod1 -1 0.5 -3.18 #> 5 mod1 -1 1 -4.57 #> 6 mod1 -0.5 -1 1.91 #> # … with 44 more rows
Note my use of seq_range()
inside data_grid()
. Instead of using every unique value of x
,
I'm going to use a regularly spaced grid of five values between the
minimum and maximum numbers. It's probably not super important here, but
it's a useful technique in general. There are two other useful
arguments to seq_range()
:
-
pretty = TRUE
will generate a "pretty" sequence, i.e. something that looks nice to the human eye. This is useful if you want to produce tables of output:seq_range(c(0.0123, 0.923423), n = 5) #> [1] 0.0123000 0.2400808 0.4678615 0.6956423 0.9234230 seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE) #> [1] 0.0 0.2 0.4 0.6 0.8 1.0
-
trim = 0.1
will trim off 10% of the tail values. This is useful if the variables have a long tailed distribution and you want to focus on generating values near the center:x1 <- rcauchy(100) seq_range(x1, n = 5) #> [1] -115.86934 -83.52130 -51.17325 -18.82520 13.52284 seq_range(x1, n = 5, trim = 0.10) #> [1] -13.841101 -8.709812 -3.578522 1.552767 6.684057 seq_range(x1, n = 5, trim = 0.25) #> [1] -2.17345439 -1.05938856 0.05467728 1.16874312 2.28280896 seq_range(x1, n = 5, trim = 0.50) #> [1] -0.7249565 -0.2677888 0.1893788 0.6465465 1.1037141
-
expand = 0.1
is in some sense the opposite oftrim()
it expands the range by 10%.x2 <- c(0, 1) seq_range(x2, n = 5) #> [1] 0.00 0.25 0.50 0.75 1.00 seq_range(x2, n = 5, expand = 0.10) #> [1] -0.050 0.225 0.500 0.775 1.050 seq_range(x2, n = 5, expand = 0.25) #> [1] -0.1250 0.1875 0.5000 0.8125 1.1250 seq_range(x2, n = 5, expand = 0.50) #> [1] -0.250 0.125 0.500 0.875 1.250
Next let's try and visualise that model. We have two continuous
predictors, so you can imagine the model like a 3d surface. We could
display that using geom_tile()
:
ggplot(grid, aes(x1, x2)) + geom_tile(aes(fill = pred)) + facet_wrap(~ model)
That doesn't suggest that the models are very different! But that's partly an illusion: our eyes and brains are not very good at accurately comparing shades of colour. Instead of looking at the surface from the top, we could look at it from either side, showing multiple slices:
ggplot(grid, aes(x1, pred, colour = x2, group = x2)) + geom_line() + facet_wrap(~ model) ggplot(grid, aes(x2, pred, colour = x1, group = x1)) + geom_line() + facet_wrap(~ model)
This shows you that interaction between two continuous variables
works basically the same way as for a categorical and continuous
variable. An interaction says that there's not a fixed offset: you need
to consider both values of x1
and x2
simultaneously in order to predict y
.
You can see that even with just two continuous variables, coming up with good visualisations are hard. But that's reasonable: you shouldn't expect it will be easy to understand how three or more variables simultaneously interact! But again, we're saved a little because we're using models for exploration, and you can gradually build up your model over time. The model doesn't have to be perfect, it just has to help you reveal a little more about your data.
I spent some time looking at the residuals to see if I could figure if mod2
did better than mod1
. I think it does, but it's pretty subtle. You'll have a chance to work on it in the exercises.