Formulas and Model Families
Formulas are the R versions of statistical equations passed to the R functions for estimation. We use formulas to specify the models, such as what terms the model will have and their transformations. This section introduces various options for specifying a model using formulas. Pay attention to specifying the intercept and interactions of variables.
Categorical variables
Generating a function from a formula is straight forward when the
predictor is continuous, but things get a bit more complicated when the
predictor is categorical. Imagine you have a formula like y ~ sex
, where sex could either be male or female. It doesn't make sense to convert that to a formula like y = x_0 + x_1 * sex
because sex
isn't a number - you can't multiply it! Instead what R does is convert it to y = x_0 + x_1 * sex_male
where sex_male
is one if sex
is male and zero otherwise:
df <- tribble( ~ sex, ~ response, "male", 1, "female", 2, "male", 1 ) model_matrix(df, response ~ sex) #> # A tibble: 3 x 2 #> `(Intercept)` sexmale #> <dbl> <dbl> #> 1 1 1 #> 2 1 0 #> 3 1 1
You might wonder why R also doesn't create a sexfemale
column. The problem is that would create a column that is perfectly predictable based on the other columns (i.e. sexfemale = 1 - sexmale
).
Unfortunately the exact details of why this is a problem is beyond the
scope of this book, but basically it creates a model family that is too
flexible, and will have infinitely many models that are equally close to
the data.
Fortunately, however, if you focus on visualising predictions you
don't need to worry about the exact parameterisation. Let's look at some
data and models to make that concrete. Here's the sim2
dataset from modelr:
ggplot(sim2) + geom_point(aes(x, y))
We can fit a model to it, and generate predictions:
mod2 <- lm(y ~ x, data = sim2) grid <- sim2 %>% data_grid(x) %>% add_predictions(mod2) grid #> # A tibble: 4 x 2 #> x pred #> <chr> <dbl> #> 1 a 1.15 #> 2 b 8.12 #> 3 c 6.13 #> 4 d 1.91
Effectively, a model with a categorical x
will predict the mean value for each category. (Why? Because the mean
minimises the root-mean-squared distance.) That's easy to see if we
overlay the predictions on top of the original data:
ggplot(sim2, aes(x)) + geom_point(aes(y = y)) + geom_point(data = grid, aes(y = pred), colour = "red", size = 4)
You can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:
tibble(x = "e") %>% add_predictions(mod2) #> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor x has new level e