Model Basics
Models are simplified representations of reality based on available observations. Both the observations and our assumptions about the form of the existing relationships affect the model we get as an outcome of the analysis. Here you will learn the general approach for specifying and estimating a linear model in statistics.
Introduction
The goal of a model is to provide a simple lowdimensional summary of a dataset. In the context of this book we're going to use models to partition data into patterns and residuals. Strong patterns will hide subtler trends, so we'll use models to help peel back layers of structure as we explore a dataset.
However, before we can start using models on interesting, real, datasets, you need to understand the basics of how models work. For that reason, this chapter of the book is unique because it uses only simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.
There are two parts to a model:

First, you define a family of models that express a precise, but generic, pattern that you want to capture. For example, the pattern might be a straight line, or a quadratic curve. You will express the model family as an equation like
y = a_1 * x + a_2
ory = a_1 * x ^ a_2
. Here,x
andy
are known variables from your data, anda_1
anda_2
are parameters that can vary to capture different patterns. 
Next, you generate a fitted model by finding the model from the family that is the closest to your data. This takes the generic model family and makes it specific, like
y = 3 * x + 7
ory = 9 * x ^ 2
.
It's important to understand that a fitted model is just the closest model from a family of models. That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true". George Box puts this well in his famous aphorism:
All models are wrong, but some are useful.
It's worth reading the fuller context of the quote:
Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an "ideal" gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules.
For such a model there is no need to ask the question "Is the model true?". If "truth" is to be the "whole truth" the answer must be "No". The only question of interest is "Is the model illuminating and useful?".
The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful.
Prerequisites
In this chapter we'll use the modelr package which wraps around base R's modelling functions to make them work naturally in a pipe.
library(tidyverse) library(modelr) options(na.action = na.warn)
A simple model
Lets take a look at the simulated dataset sim1
, included with the modelr package. It contains two continuous variables, x
and y
. Let's plot them to see how they're related:
ggplot(sim1, aes(x, y)) + geom_point()
You can see a strong pattern in the data. Let's use a model to
capture that pattern and make it explicit. It's our job to supply the
basic form of the model. In this case, the relationship looks linear,
i.e. y = a_0 + a_1 * x
. Let's start by getting a feel for
what models from that family look like by randomly generating a few and
overlaying them on the data. For this simple case, we can use geom_abline()
which takes a slope and intercept as parameters. Later on we'll learn more general techniques that work with any model.
models < tibble( a1 = runif(250, 20, 40), a2 = runif(250, 5, 5) ) ggplot(sim1, aes(x, y)) + geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) + geom_point()
There are 250 models on this plot, but a lot are really bad! We need
to find the good models by making precise our intuition that a good
model is "close" to the data. We need a way to quantify the distance
between the data and a model. Then we can fit the model by finding the
value of a_0
and a_1
that generate the model with the smallest distance from this data.
One easy place to start is to find the vertical distance between each point and the model, as in the following diagram. (Note that I've shifted the x values slightly so you can see the individual distances).
This distance is just the difference between the y value given by the model (the prediction), and the actual y value in the data (the response).
To compute this distance, we first turn our model family into an R function. This takes the model parameters and the data as inputs, and gives values predicted by the model as output:
model1 < function(a, data) { a[1] + data$x * a[2] } model1(c(7, 1.5), sim1) #> [1] 8.5 8.5 8.5 10.0 10.0 10.0 11.5 11.5 11.5 13.0 13.0 13.0 14.5 14.5 14.5 #> [16] 16.0 16.0 16.0 17.5 17.5 17.5 19.0 19.0 19.0 20.5 20.5 20.5 22.0 22.0 22.0
Next, we need some way to compute an overall distance between the predicted and actual values. In other words, the plot above shows 30 distances: how do we collapse that into a single number?
One common way to do this in statistics to use the "rootmeansquared deviation". We compute the difference between actual and predicted, square them, average them, and the take the square root. This distance has lots of appealing mathematical properties, which we're not going to talk about here. You'll just have to take my word for it!
measure_distance < function(mod, data) { diff < data$y  model1(mod, data) sqrt(mean(diff ^ 2)) } measure_distance(c(7, 1.5), sim1) #> [1] 2.665212
Now we can use purrr to compute the distance for all the models defined above. We need a helper function because our distance function expects the model as a numeric vector of length 2.
sim1_dist < function(a1, a2) { measure_distance(c(a1, a2), sim1) } models < models %>% mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist)) models #> # A tibble: 250 x 3 #> a1 a2 dist #> <dbl> <dbl> <dbl> #> 1 15.2 0.0889 30.8 #> 2 30.1 0.827 13.2 #> 3 16.0 2.27 13.2 #> 4 10.6 1.38 18.7 #> 5 19.6 1.04 41.8 #> 6 7.98 4.59 19.3 #> # … with 244 more rows
Next, let's overlay the 10 best models on to the data. I've coloured the models by dist
: this is an easy way to make sure that the best models (i.e. the ones with the smallest distance) get the brighest colours.
ggplot(sim1, aes(x, y)) + geom_point(size = 2, colour = "grey30") + geom_abline( aes(intercept = a1, slope = a2, colour = dist), data = filter(models, rank(dist) <= 10) )
We can also think about these models as observations, and visualising with a scatterplot of a1
vs a2
, again coloured by dist
.
We can no longer directly see how the model compares to the data, but
we can see many models at once. Again, I've highlighted the 10 best
models, this time by drawing red circles underneath them.
ggplot(models, aes(a1, a2)) + geom_point(data = filter(models, rank(dist) <= 10), size = 4, colour = "red") + geom_point(aes(colour = dist))
Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of points (this is called a grid search). I picked the parameters of the grid roughly by looking at where the best models were in the plot above.
grid < expand.grid( a1 = seq(5, 20, length = 25), a2 = seq(1, 3, length = 25) ) %>% mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist)) grid %>% ggplot(aes(a1, a2)) + geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") + geom_point(aes(colour = dist))
When you overlay the best 10 models back on the original data, they all look pretty good:
ggplot(sim1, aes(x, y)) + geom_point(size = 2, colour = "grey30") + geom_abline( aes(intercept = a1, slope = a2, colour = dist), data = filter(grid, rank(dist) <= 10) )
You could imagine iteratively making the grid finer and finer until
you narrowed in on the best model. But there's a better way to tackle
that problem: a numerical minimisation tool called NewtonRaphson
search. The intuition of NewtonRaphson is pretty simple: you pick a
starting point and look around for the steepest slope. You then ski down
that slope a little way, and then repeat again and again, until you
can't go any lower. In R, we can do that with optim()
:
best < optim(c(0, 0), measure_distance, data = sim1) best$par #> [1] 4.222248 2.051204 ggplot(sim1, aes(x, y)) + geom_point(size = 2, colour = "grey30") + geom_abline(intercept = best$par[1], slope = best$par[2])
Don't worry too much about the details of how optim()
works. It's the intuition that's important here. If you have a function
that defines the distance between a model and a dataset, an algorithm
that can minimise that distance by modifying the parameters of the
model, you can find the best model. The neat thing about this approach
is that it will work for any family of models that you can write an
equation for.
There's one more approach that we can use for this model, because
it's a special case of a broader family: linear models. A linear model
has the general form y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n  1)
. So this simple model is equivalent to a general linear model where n is 2 and x_1
is x
. R has a tool specifically designed for fitting linear models called lm()
. lm()
has a special way to specify the model family: formulas. Formulas look like y ~ x
, which lm()
will translate to a function like y = a_1 + a_2 * x
. We can fit the model and look at the output:
sim1_mod < lm(y ~ x, data = sim1) coef(sim1_mod) #> (Intercept) x #> 4.220822 2.051533
optim()
! Behind the scenes lm()
doesn't use optim()
but instead takes advantage of the mathematical structure of linear
models. Using some connections between geometry, calculus, and linear
algebra, lm()
actually finds the closest model in a single step, using a
sophisticated algorithm. This approach is both faster, and guarantees
that there is a global minimum.Source: H. Wickham and G. Grolemund, https://r4ds.had.co.nz/modelbasics.html
This work is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 3.0 License.