The Assumptions of Simple Linear Regression

Site: Saylor Academy
Course: CS250: Python for Data Science
Book: The Assumptions of Simple Linear Regression
Printed by: Guest user
Date: Saturday, May 18, 2024, 5:49 AM

Description

The key to creating any statistical model is to verify if the model actually explains the data. In addition to simple visual inspection, residuals provide a pathway for making a rigorous estimate of model accuracy when applying any form of regression. Read this overview of how residuals are applied.

Overview

How do we evaluate a model? How do we know if the model we are using is good? One way to consider these questions is to assess whether the assumptions underlying the simple linear regression model seem reasonable when applied to the dataset in question. Since the assumptions relate to the (population) prediction errors, we do this through the study of the (sample) estimated errors, the residuals.

We focus in this lesson on graphical residual analysis. When we revisit this topic in the context of multiple linear regression in Lesson 7, we'll also study some statistical tests for assessing the assumptions. We'll consider various remedies for when linear regression model assumptions fail throughout the rest of the course, particularly in Lesson 9.

Objectives

Upon completion of this lesson, you should be able to:

  • Understand why we need to check the assumptions of our model.
  • Know the things that can go wrong with the linear regression model.
  • Know how we can detect various problems with the model using a residuals vs. fits plot.
  • Know how we can detect various problems with the model using residuals vs. predictor plots.
  • Know how we can detect a certain kind of dependent error terms using residuals vs. order plots.
  • Know how we can detect non-normal error terms using a normal probability plot.

Source: The Pennsylvania State University, https://online.stat.psu.edu/stat501/lesson/4
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.

Background

In this lesson, we learn how to check the appropriateness of a simple linear regression model. Recall that the four conditions ("LINE") that comprise the simple linear regression model are:

  • Linear Function: The mean of the response, \mbox{E}(Y_i), at each value of the predictor, x_i, is a Linear function of the x_i.
  • Independent: The errors,  \epsilon_{i}, are Independent.
  • Normally Distributed: The errors,  \epsilon_{i}, at each value of the predictor, x_i, are Normally distributed.
  • Equal variances: The errors,  \epsilon_{i}, at each value of the predictor, x_i, have Equal variances (denoted \sigma^{2}).


An equivalent way to think of the first (linearity) condition is that the mean of the error, \mbox{E}(\epsilon_i), at each value of the predictor, x_i, is zero. An alternative way to describe all four assumptions is that the errors, \epsilon_i, are independent normal random variables with mean zero and constant variance, \sigma^2.

The four conditions of the model pretty much tell us what can go wrong with our model, namely:

  • The population regression function is not linear. That is, the response Y_{i} is not a function of linear trend ( \beta_{0} +  \beta_{1} x_i ) plus some error \epsilon_i .
  • The error terms are not independent.
  • The error terms are not normally distributed.
  • The error terms do not have equal variance.

In this lesson, we learn ways to detect the above four situations, as well as learn how to identify the following two problems:

  • The model fits all but one or a few unusual observations. That is, are there any "outliers"?
  • An important predictor variable has been left out of the model. That is, could we do better by adding a second or third predictor into the model and instead use a multiple regression model to answer our research questions?

Before jumping in, let's make sure it's clear why we have to evaluate any regression model that we formulate and subsequently estimate. In short, it's because:

  • All of the estimates, intervals, and hypothesis tests arising in a regression analysis have been developed assuming that the model is correct. That is, all the formulas depend on the model being correct!
  • If the model is incorrect, then the formulas and methods we use are at risk of being incorrect.

The good news is that some of the model conditions are more forgiving than others. So, we really need to learn when we should worry the most and when it's okay to be more carefree about model violations. Here's a pretty good summary of the situation:

  • All tests and intervals are very sensitive to even minor departures from independence.
  • All tests and intervals are sensitive to moderate departures from equal variance.
  • The hypothesis tests and confidence intervals for  \beta_{0} and  \beta_{1} are fairly "robust" (that is, forgiving) against departures from normality.
  • Prediction intervals are quite sensitive to departures from normality.

The important thing to remember is that the severity of the consequences is always related to the severity of the violation. And how much you should worry about a model violation depends on how you plan to use your regression model. For example, if all you want to do with your model is a test for a relationship between x and y, i.e. test that the slope  \beta_{1} is 0, you should be okay even if it appears that the normality condition is violated. On the other hand, if you want to use your model to predict a future response y_{\text{new}}, then you are likely to get inaccurate results if the error terms are not normally distributed.

In short, you'll need to learn how to worry just the right amount. Worry when you should, and don't ever worry when you shouldn't! And when you are worried, there are remedies available, which we'll learn more about later in the course. For example, one thing to try is transforming either the response variable, predictor variable, or both - there is an example of this in Section 4.8, and we'll see more examples in Lesson 9.

This is definitely a lesson in which you are exposed to the idea that data analysis is an art (subjective decisions!) based on science (objective tools!). We might, therefore, call data analysis "an artful science!" Let's get to it!


The basic idea of residual analysis

Recall that not all of the data points in a sample will fall right on the least squares regression line. The vertical distance between any one data point y_i and its estimated value \hat{y}_i is its observed "residual":

e_i = y_i-\hat{y}_i

Each observed residual can be thought of as an estimate of the actual unknown "true error" term:

\epsilon_i = Y_i-E(Y_i)

Let's look at an illustration of the distinction between a residual e_{i} and an unknown true error term  \epsilon_{i}. The solid line on the plot describes the true (unknown) linear relationship in the population. Most often, we can't know this line. However, if we could, the true error would be the distance from the data point to the solid line.

On the other hand, the dashed line on the plot represents the estimated linear relationship for a random sample. The residual error is the distance from the data point to the dashed line.



The observed residuals should reflect the properties assumed for the unknown true error terms. The basic idea of residual analysis, therefore, is to investigate the observed residuals to see if they behave "properly". That is, we analyze the residuals to see if they support the assumptions of linearity, independence, normality, and equal variances.

Residuals vs. Fits Plot

When conducting a residual analysis, a "residuals versus fits plot" is the most frequently created plot. It is a scatter plot of residuals on the y-axis and fitted values (estimated responses) on the x-axis. The plot is used to detect non-linearity, unequal error variances, and outliers.

Let's look at an example to see what a "well-behaved" residual plot looks like. Some researchers (Urbano-Marquez, et al., 1989) were interested in determining whether or not alcohol consumption was linearly related to muscle strength. The researchers measured the total lifetime consumption of alcohol (x) on a random sample of n = 50 alcoholic men. They also measured the strength (y) of the deltoid muscle in each person's non-dominant arm. A fitted line plot of the resulting data (Alcohol Arm data) looks like:

strength vs alcohol plot



The plot suggests that there is a decreasing linear relationship between alcohol and arm strength. It also suggests that there are no unusual data points in the data set. And it illustrates that the variation around the estimated regression line is constant, suggesting that the assumption of equal error variances is reasonable.

Here's what the corresponding residuals versus fits plot looks like for the data set's simple linear regression model with arm strength as the response and level of alcohol consumption as the predictor:

residual vs fitted value plot


Note that, as defined, the residuals appear on the y-axis and the fitted values appear on the x-axis. You should be able to look back at the scatter plot of the data and see how the data points there correspond to the data points in the residual versus fits plot here. In case you're having trouble with doing that, look at the five data points in the original scatter plot that appear in red. Note that the predicted response (fitted value) of these men (whose alcohol consumption is around 40) is about 14. Also, note the pattern in which the five data points deviate from the estimated regression line.

Now, look at how and where these five data points appear in the residuals versus fits plot. Their fitted value is about 14, and their deviation from the residual = 0 line shares the same pattern as their deviation from the estimated regression line. Do you see the connection? Any data point that falls directly on the estimated regression line has a residual of 0. Therefore, the residual = 0 line corresponds to the estimated regression line.

This plot is a classical example of a well-behaved residuals vs. fits plot. Here are the characteristics of a well-behaved residual vs. fits plot and what they suggest about the appropriateness of the simple linear regression model:

  • The residuals "bounce randomly" around the residual = 0 line. This suggests that the assumption that the relationship is linear is reasonable.
  • The residuals roughly form a "horizontal band" around the residual = 0 line. This suggests that the variances of the error terms are equal.
  • No one residual "stands out" from the basic random pattern of residuals. This suggests that there are no outliers.


In general, you want your residual vs. fits plots to look something like the above plot. Don't forget, though, that interpreting these plots is subjective. My experience has been that students learning residual analysis for the first time tend to over-interpret these plots, looking at every twist and turn as something potentially troublesome. You'll especially want to be careful about putting too much weight on residual vs. fits plots based on small data sets. Sometimes the data sets are just too small to make interpretation of a residuals vs. fits plot worthwhile. Don't worry! You will learn - with practice - how to "read" these plots, although you will also discover that interpreting residual plots like this is not straightforward. Humans love to seek out order in chaos and patterns in randomness. It's like looking up at the clouds in the sky - sooner or later, you start to see images of animals. Resist this tendency when doing graphical residual analysis. Unless something is pretty obvious, try not to get too excited, particularly if the "pattern" you think you are seeing is based on just a few observations. You will learn some numerical methods for supplementing the graphical analyses in Lesson 7. For now, just do the best you can, and if you're not sure if you see a pattern or not, just say that.


Try it! Residual analysis

The least squares estimate from fitting a line to the data points in Residual dataset are b_0 = 6 and b_1 = 3. (You can check this claim, of course).

  1. Copy x-values in, say, column C1 and y-values in column C2 of a Minitab worksheet.
  2. Using the least squares estimates, create a new column that contains the predicted values,\hat{y}_i, for each x_{i} - you can use Minitab's calculator to do this. Select Calc >> Calculator... In the box labeled "Store result in variable", specify the new column, say C3, where you want the predicted values to appear. In the box labeled Expression, type 6+3*C1. Select OK. The predicted values, \hat{y}_i, should appear in column C3. You might want to label this column "fitted". You might also convince yourself that you indeed calculated the predicted values by checking one of the calculations by hand.
  3. Now, create a new column, say C4, that contains the residual values - again, use Minitab's calculator to do this. Select Calc >> Calculator... In the box labeled "Store result in variable", specify the new column, say C4, where you want the residuals to appear. In the box labeled Expression, type C2-C3. Select OK. The residuals, e_{i}, should appear in column C4. You might want to label this column "resid". You might also convince yourself that you indeed calculated the residuals by checking one of the calculations by hand.
  4. Create a "residuals versus fits" plot, that is, a scatter plot with the residuals (e_{i}) on the vertical axis and the fitted values (\hat{y}_i) on the horizontal axis. (See Minitab Help Section - Creating a basic scatter plot). Around what horizontal line (residual = ??) do the residuals "bounce randomly?" What does this horizontal line represent?

Residuals vs. Predictor Plot

An alternative to the residuals vs. fits plot is a "residuals vs. predictor plot". It is a scatter plot of residuals on the y-axis and the predictor (x) values on the x-axis. For a simple linear regression model, if the predictor on the x-axis is the same predictor that is used in the regression model, the residuals vs. predictor plot offers no new information to that which is already learned by the residuals vs. fits plot. On the other hand, if the predictor on the x-axis is a new and different predictor, the residuals vs. predictor plot can help to determine whether the predictor should be added to the model (and hence a multiple regression model used instead).

The interpretation of a "residuals vs. predictor plot" is identical to that of a "residuals vs. fits plot". That is, a well-behaved plot will bounce randomly and form a roughly horizontal band around the residual = 0 line. And, no data points will stand out from the basic random pattern of the other residuals.

Here's the residuals vs. predictor plot for the data set's simple linear regression model with arm strength as the response and level of alcohol consumption as the predictor:

residual vs alcohol plot

Note that, as defined, the residuals appear on the y-axis, and the predictor values - the lifetime alcohol consumption for the men - appear on the x-axis. Now, you should be able to look back at the scatter plot of the data:

strength vs alcohol plot

and the residuals vs. fits plot:

residual vs fitted value plot

to see how the data points there correspond to the data points in the residuals versus predictor plot:

residual vs alcohol plot

The five red data points should help you out again. The alcohol consumption of the five men is about 40, hence why the points now appear on the "right side" of the plot. In essence, for this example, the residuals vs. predictor plot is just a mirror image of the residuals vs. fits plot. The residuals vs. predictor plot offers no new information.

Let's take a look at an example in which the residuals vs. predictor plot is used to determine whether or not another predictor should be added to the model. A researcher is interested in determining which of the following - age, weight, and duration of hypertension - are good predictors of the diastolic blood pressure of an individual with high blood pressure. The researcher measured the age (in years), weight (in pounds), duration of hypertension (in years), and diastolic blood pressure (in mm Hg) on a sample of n = 20 hypertensive individuals (Blood Pressure data).

The regression of the response diastolic blood pressure (BP) on the predictor age:
bp vs age plot

suggests that there is a moderately strong linear relationship (r2 = 43.44%) between diastolic blood pressure and age. The regression of the response diastolic blood pressure (BP) on the predictor weight:

bp vs weight plot

suggests that there is a strong linear relationship (r2 = 90.26%) between diastolic blood pressure and weight. And the regression of the response diastolic blood pressure (BP) on the predictor duration:

bp vs duration plot

suggests that there is little linear association (r2 = 8.6%) between diastolic blood pressure and the duration of hypertension. In summary, it appears as if weight has the strongest association with diastolic blood pressure, age has the second strongest association, and duration the weakest.

Let's investigate various residuals vs. predictors plots to learn whether adding predictors to any of the above three simple linear regression models is advised. Upon regressing blood pressure on age, obtaining the residuals, and plotting the residuals against the predictor weight, we obtain the following "residuals versus weight" plot:

residual vs weight plot

This "residuals versus weight" plot can be used to determine whether we should add the predictor weight to the model that already contains the predictor age. In general, if there is some non-random pattern to the plot, it indicates that it would be worthwhile adding the predictor to the model. In essence, you can think of the residuals on the y-axis as a "new response," namely the individual's diastolic blood pressure adjusted for their age. If a plot of the "new response" against a predictor shows a non-random pattern, it indicates that the predictor explains some of the remaining variability in the new (adjusted) response. Here, there is a pattern in the plot. It appears that adding the predictor weight to the model already containing age would help to explain some of the remaining variability in the response.

We haven't yet learned about multiple linear regression models - regression models with more than one predictor. But, you'll soon learn that it's a straightforward extension of simple linear regression. Suppose we fit the model with blood pressure as the response and age and weight as the two predictors. Should we also add the predictor duration to the model? Let's investigate! Upon regressing blood pressure on weight and age, obtaining the residuals, and plotting the residuals against the predictor duration, we obtain the following "residuals versus duration" plot:

residual vs duration plot

The points on the plot show no pattern or trend, suggesting that there is no relationship between the residuals and duration. That is, the residuals vs. duration plot tells us that there is no sense in adding duration to the model that already contains age and weight. Once we've explained the variation in the individuals' blood pressures by taking into account the individuals' ages and weights, no significant amount of the remaining variability can be explained by the individuals' durations.


Try it! Residual analysis

The basic idea (continued)
In the practice problems in the previous section, you created a residuals versus fits plot "by hand" for the data contained in the Residuals dataset. Now, create a residuals versus predictor plot, that is, a scatter plot with the residuals (e_i) on the y axis and the predictor (x_i) values on the x-axis. In what way - if any - does this plot differ from the residuals versus fit plot you obtained previously?

The only difference between the plots is the scale on the horizontal axis.

Using residual plots to help identify other good predictors

To assess physical conditioning in normal individuals, it is useful to know how much energy they are capable of expending. Since the process of expending energy requires oxygen, one way to evaluate this is to look at the rate at which they use oxygen at peak physical activity. To examine the peak physical activity, tests have been designed where an individual runs on a treadmill. At specified time intervals, the speed at which the treadmill moves and the grade of the treadmill both increase. The individual is then systematically run to maximum physical capacity. The maximum capacity is determined by the individual; the person stops when unable to go any further. A researcher subjected 44 healthy individuals to such a treadmill test, collecting the following data:

  • vo_{2} (max) = a measure of oxygen consumption, defined as the volume of oxygen used per minute per kilogram of body weight
  • dur = how long, in seconds, the individual lasted on the treadmill
  • age = age, in years of individual
The data set Treadmill Dataset contains the data on the 44 individuals.

  1. Fit a simple linear regression model using Minitab's fitted line plot treating vo_{2} as the response y and dur as the predictor x. (See Minitab Help Section - Creating a fitted line plot). Does there appear to be a linear relationship between vo_{2} and dur?

    Yes, there appears to be a strong linear relationship between vo_{2} and dur based on the scatterplot and r-squared = 81.9%.

  2. Fit a simple linear regression model using Minitab's fitted line plot treating vo_{2} as the response y and age as the predictor x. Does there appear to be a linear relationship between vo_{2} and age?

    Yes, there appears to be a moderate linear relationship between vo_{2} and age based on the scatterplot and r-squared = 44.3%

  3. Fit a simple linear regression model using Minitab's fitted line plot treating dur as the response y and age as the predictor x. Does there appear to be a linear relationship between age and dur?

    Yes, there appears to be a moderate linear relationship between age and dur based on the scatterplot and r-squared = 43.6%.


  4. Now, fit a simple linear regression model using Minitab's regression command treating vo_{2} as the response y and dur as the predictor x. In doing so, request a residuals vs. age plot. (See Minitab Help Section - Creating residual plots). Does the residuals vs. age plot suggest that age would be an additional good predictor to add to the model to help explain some of the variation in vo_{2}?

    After fitting a simple linear regression model with vo_{2} as the response y and dur as the predictor x, the residuals vs. age plot does not suggest that age would be an additional good predictor to add to the model to help explain some of the variation in vo_{2} since there does not appear to be a strong linear trend in this plot.

  5. Now, fit a simple linear regression model using Minitab's regression command treating vo_{2} as the response y and age as the predictor x. In doing so, request a residuals vs. dur plot. Does the residuals vs. dur plot suggest that dur would be an additional good predictor to add to the model to help explain some of the variation in vo_{2}?

    After fitting a simple linear regression model with vo_{2} as the response y and age as the predictor x, the residuals vs. dur plot suggests that dur could be an additional good predictor to add to the model to help explain some of the variation in vo_{2} since there is a moderate linear trend in this plot.

  6. Summarize what is happening here.

    Of the two predictors, dur has a stronger linear association with vo_{2} than age. So, there is no benefit to adding age to a model including dur. However, there is some benefit to adding dur to a model including age. If you do this and fit a multiple linear regression model with both age and dur as the predictors then it turns out that dur is significant (at the 0.05 level) but age is not. In summary, the "best" model includes just dur, but a model with age and dur is better than a model with just age.

Identifying Specific Problems Using Residual Plots

In this section, we learn how to use residuals versus fits (or predictor) plots to detect problems with our formulated regression model. Specifically, we investigate:

  • how a non-linear regression function shows up on a residuals vs. fits plot
  • how unequal error variances show up on a residuals vs. fits plot
  • how an outlier shows up on a residuals vs. fits plot.

Note! that although we will use residuals vs. fits plots throughout our discussion here, we just as easily could use residuals vs. predictor plots (providing the predictor is the one in the model)

Residuals vs. Order Plot

Recall that the second condition - the "I" condition - of the linear regression model is that the error terms are independent. In this section, we learn how to use a "residuals vs. order plot" as a way of detecting a particular form of non-independence of the error terms, namely serial correlation. If the data are obtained in a time (or space) sequence, a residuals vs. order plot helps to see if there is any correlation between the error terms that are near each other in the sequence.

The plot is only appropriate if you know the order in which the data were collected! Highlight this, underline this, circle this, ... er, on second thought, don't do that if you are reading it on a computer screen. Do whatever it takes to remember it, though - it is a very common mistake made by people new to regression analysis.

So, what is this residuals vs. order plot all about? As its name suggests, it is a scatter plot with residuals on the y-axis and the order in which the data were collected on the x-axis. Here's an example of a well-behaved residual vs. order plot:

residual vs observed order plot



The residuals bounce randomly around the residual = 0 line, as we would hope so. In general, residuals exhibiting normal random noise around the residual = 0 line suggests that there is no serial correlation.

Let's take a look at examples of the different kinds of residuals vs. order plots we can obtain and learn what each tells us.


A time trend

Residuals vs. order plot that exhibits (positive) trend as the following plot does:

residual vs observed order plot



suggests that some of the variations in the response are due to time. Therefore, it might be a good idea to add the predictor "time" to the model. That is, you interpret this plot just as you would interpret any other residual vs. predictor plot. It's just that here your predictor is "time".


Positive serial correlation

A residuals vs. order plot that looks like the following plot:

residual vs observed order plot



suggests that there is a "positive serial correlation" among the error terms. That is, a positive serial correlation exists when residuals tend to be followed, in time, by residuals of the same sign and about the same magnitude. The plot suggests that the assumption of independent error terms is violated.

Here is another less obvious example of a data set exhibiting positive serial correlation:




Can you see a cyclical trend -- up and then down, up and down, and up again?  Certainly, the positive serial correlation in the error terms is not as obvious here as in the previous example. These two examples taken together are a nice illustration of "the severity of the consequences is related to the severity of the violation." The violation in the previous example is much more severe than in this example. Therefore, we should expect that the consequences of using a regression model in the previous example would be much greater than using one in this example. In either case, you would be advised to move out of the realm of regression analysis and into that of "time series modeling".


Negative serial correlation

A residuals vs. order plot that looks like the following plot:

residual vs observed order plot


suggests that there is a "negative serial correlation" among the error terms. A negative serial correlation exists when residuals of one sign tend to be followed, in time, by residuals of the opposite sign. What? Can't you see it? If you connect the dots in order from left to right, you should be able to see the pattern. If you can't see it, drag the arrow on the left across the image:



Negative, positive, negative, positive, negative, positive, and so on. The plot suggests that the assumption of independent error terms is violated. If you obtain a residuals vs. order plot that looks like this, you would again be advised to move out of the realm of regression analysis and into that of "time series modeling".

Normal Probability Plot of Residuals

Recall that the third condition - the "N" condition - of the linear regression model is that the error terms are normally distributed. In this section, we learn how to use a "normal probability plot of the residuals" as a way of learning whether it is reasonable to assume that the error terms are normally distributed.

Here's the basic idea behind any normal probability plot: if the data follow a normal distribution with mean \mu and variance σ^{2}, then a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles should be approximately linear. Since we are concerned about the normality of the error terms, we create a normal probability plot of the residuals. If the resulting plot is approximately linear, we proceed to assume that the error terms are normally distributed.

The theoretical p-th percentile of any normal distribution is the value such that p% of the measurements fall below the value. Here's a screencast illustrating a theoretical p-th percentile.



The problem is that to determine the percentile value of a normal distribution, you need to know the mean \mu and the variance \sigma^2. And, of course, the parameters \mu and σ^{2} are typically unknown. Statistical theory says its okay just to assume that \mu = 0 and \sigma^2 = 1. Once you do that, determining the percentiles of the standard normal curve is straightforward. The p-th percentile value reduces to just a "Z-score" (or "normal score"). Here's a screencast illustrating how the p-th percentile value reduces to just a normal score.



The sample p-th percentile of any data set is, roughly speaking, the value such that p% of the measurements fall below the value. For example, the median, which is just a special name for the 50th-percentile, is the value so that 50%, or half, of your measurements fall below the value. Now, if you are asked to determine the 27th percentile, you take your ordered data set, and you determine the value so that 27% of the data points in your dataset fall below the value. And so on.

Consider a simple linear regression model fit to a simulated dataset with 9 observations, so that we're considering the 10th, 20th, ..., 90th percentiles. A normal probability plot of the residuals is a scatter plot with the theoretical percentiles of the normal distribution on the x-axis and the sample percentiles of the residuals on the y-axis, for example:

normal score vs residual plot



The diagonal line (which passes through the lower and upper quartiles of the theoretical distribution) provides a visual aid to help assess whether the relationship between the theoretical and sample percentiles is linear.

Note that the relationship between the theoretical percentiles and the sample percentiles is approximately linear. Therefore, the normal probability plot of the residuals suggests that the error terms are indeed normally distributed.

Statistical software sometimes provides normality tests to complement the visual assessment available in a normal probability plot. Different software packages sometimes switch the axes for this plot, but its interpretation remains the same.

Let's take a look at examples of the different kinds of normal probability plots we can obtain and learn what each tells us.


Normally distributed residuals

Histogram

The following histogram of residuals suggests that the residuals (and hence the error terms) are normally distributed:

histogram of residuals



Normal Probability Plot

The normal probability plot of the residuals is approximately linear, supporting the condition that the error terms are normally distributed.

normal probability plot


Normal residuals but with one outlier

Histogram

The following histogram of residuals suggests that the residuals (and hence the error terms) are normally distributed. But, there is one extreme outlier (with a value larger than 4):

histogram of residuals



Normal Probability Plot

Here's the corresponding normal probability plot of the residuals:

normal probability plot


This is a classic example of what a normal probability plot looks like when the residuals are normally distributed, but there is just one outlier. The relationship is approximately linear with the exception of one data point. We could proceed with the assumption that the error terms are normally distributed upon removing the outlier from the data set.


Skewed residuals

Histogram

The following histogram of residuals suggests that the residuals (and hence the error terms) are not normally distributed. On the contrary, the distribution of the residuals is quite skewed.

histogram of residuals



Normal Probability Plot

Here's the corresponding normal probability plot of the residuals:

normal probability plot


This is a classic example of what a normal probability plot looks like when the residuals are skewed. Clearly, the condition that the error terms are normally distributed is not met.


Heavy-tailed residuals

Histogram

The following histogram of residuals suggests that the residuals (and hence the error terms) are not normally distributed. There are too many extreme positive and negative residuals. We say the distribution is "heavy tailed".

histogram of residuals



Normal Probability Plot

Here's the corresponding normal probability plot of the residuals:

normal probability plot



The relationship between the sample percentiles and theoretical percentiles is not linear. Again, the condition that the error terms are normally distributed is not met.

Assessing Linearity by Visual Inspection

The first simple linear regression model condition concerns linearity: the mean of the response at each predictor value should be a linear function of the predictor. The neat thing about simple linear regression - in which there is a response y and just one predictor x - is that we can get a good feel for this condition just by looking at a simple scatter plot (so, in this case, we don't even need to look at a residual plot). Let's start by looking at three different examples.


Skin Cancer and Mortality

Do the data suggest that a linear function is adequate in describing the relationship between skin cancer mortality and latitude (Skin Cancer dataset)?

mortality vs latitude plot


The answer is yes! It appears as if the relationship between latitude and skin cancer mortality is indeed linear, and therefore it would be best if we summarized the trend in the data using a linear function.


Alligators

The length of an alligator can be estimated fairly accurately from aerial photographs or from a boat. Estimating the weight of the alligator, however, is a much greater challenge. One approach is to use a regression model that summarizes the trend between the length and weight of alligators. The length of an alligator obtained from an aerial photograph or boat can then be used to predict the weight of the alligator. In taking this approach, some wildlife biologists captured a random sample of n = 25 alligators. They measured the length (x, in inches) and weight (y, in pounds) of each alligator. (Alligator dataset)

Do the resulting data suggest that a linear function is adequate in describing the relationship between the length and weight of an alligator?

weight vs length plot


The answer is no! Don't you think a curved function would more adequately describe the trend? The scatter plot gives us a pretty good indication that a linear model is inadequate in this case.


Alloy Corrosion

Thirteen (n = 13) alloy specimens comprised of 90% copper and 10% nickel - each with a specific iron content - were tested for corrosion. Each specimen was rotated in salty seawater at 30 feet per second for 60 days. The corrosion was measured in weight loss in milligrams/square decimeters/day. The researchers were interested in studying the relationship between iron content (x) and weight loss due to corrosion (y). (Corrosion dataset)

Do the resulting data that appear in the following plot suggest that a linear function is adequate in describing the relationship between iron content and weight loss due to corrosion?

weight loss vs iron content plot


The answer is yes! As in the first example, our visual inspection of the data suggests that a linear model would be adequate in describing the trend between iron content and weight loss due to corrosion.


Try It! Visual inspection of plots

  1. Income and time to the first child. The Income and Birth data set contains the husband's annual incomes (inc, in dollars) and time (time, in months) between marriage and first child for n = 20 couples. (As you can tell by the incomes, the data set is rather old!)
    1. Create a fitted line plot treating time as the response and inc as the predictor. (See Minitab Help: Creating a fitted line plot).
    2. Looking at the plot, is a linear function adequate in describing the relationship between inc and time? Explain your answer.

No, the data displays a curvilinear relationship between Y = time and X = inc.

scatterplot


  1. Bluegill fish. The Blue Gills dataset contains the lengths (in mm) and ages (in years) of n = 78 bluegill fish.
    1. Create a fitted line plot treating length as the response and age as the predictor.
    2. Looking at the plot, is a linear function adequate in describing the relationship between age and length? Explain your answer.

Probably not, because the growth pattern seems steeper than the fitted line for ages 1-4, and then length seems to level out for ages 5-6.

scatterplot


  1. Gesell adaptive scores. The Adaptive dataset contains the Gesell adaptive scores and ages (in months) of n = 21 children with cyanotic heart disease.

    1. Create a fitted line plot treating score as the response and age as the predictor.
    2. Looking at the plot, is a linear function adequate in describing the relationship between age and score? Explain your answer.

The linear function describes the relationship reasonably well for most of the data points but seems strongly influenced by the point for age = 42 at the far right, and the point with score = 120 at the top does not seems to fit the overall trend very well.

scatterplot

Further Examples

Below is a plot of residuals versus fits after a straight-line model was used on data for y = handspan (cm) and x = height (inches) for n = 167 students (Hand and Height dataset).

graph

Interpretation: This plot looks good in that the variance is roughly the same all the way across, and there are no worrisome patterns. There seem to be no difficulties with the model or data.


Example 4-2: Residual Plot Resulting from Using the Wrong Model

Below is a plot of residuals versus fits after a straight-line model was used on data for y = concentration of a chemical solution and x = time after the solution was made (Solutions Concentration dataset).

graph

Interpretation: This plot of residuals versus plots shows two difficulties. First, the pattern is curved, which indicates that the wrong type of equation was used. Second, the variance (vertical spread) increases as the fitted values (predicted values) increase.


Example 4-3: Indications that Assumption of Constant Variance is Not Valid

Below is a plot of residuals versus fits after a straight-line model was used on data for y = sale price of a home and x = square foot area of the home (Real estate dataset).

graph

Interpretation: This plot of residuals versus fits shows that the residual variance (vertical spread) increases as the fitted values (predicted values of the sale price) increase. This violates the assumption of constant error variance.


Example 4-4: Indications that Assumption of Normal Distribution for Errors is Valid

The graphs below are a histogram and a normal probability plot of the residuals after a straight-line model was used for fitting y = time to next eruption and x = duration of the last eruption for eruptions (Old Faithful dataset).

graph

graph

Interpretation: The histogram is roughly bell-shaped so it is an indication that it is reasonable to assume that the errors have a normal distribution. The pattern of the normal probability plot is straight, so this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution.


Example 4-5: Indications that Assumption of Normal Distribution for Errors is Not Valid

Below is a normal probability plot for the residuals from a straight-line regression with y = infection risk in a hospital and x = average length of stay in the hospital. The observational units are hospitals, and the data are taken from regions 1 and 2 (Infection Risk dataset).

graph

Interpretation: The plot shows some deviation from the straight-line pattern indicating a distribution with heavier tails than a normal distribution.


Example 4-6: Stopping Distance Data

We investigate how transforming y can sometimes help us with nonconstant variance problems. We will look at the stopping distance data with y = stopping distance of a car and x = speed of the car when the brakes were applied (Car Stopping data). A graph of the data is given below.

graph

Fitting a simple linear regression model to these data leads to problems with both curvature and nonconstant variance. One possible remedy is to transform y. With some trial and error, we find that there is an approximately linear relationship between

and x with no suggestion of nonconstant variance.

The Minitab output below gives the regression equation for square root distance on speed along with predicted values and prediction intervals for speeds of 10, 20, 30, and 40 mph. The predictions are for the square root of stopping distance.


Regression Equation

sqrtdist = 0.918 + .253 Speed

Speed Fit 95% PI
10 3.44 1.98, 4.90
20 5.97 4.52, 7.42
30 8.50 7.03, 9.97
40 11.03 9.53, 13.53

Then, the output below shows predicted values and prediction intervals when we square the results (i.e., transform back to the scale of the original data).

Speed Fit 95% PI
10 11.83 3.92, 24.01
20 35.64 20.43, 55.06
30 72.25 49.42, 99.40
40 121.66 90.82, 156.75
 

Notice that the predicted values coincide more or less with the average pattern in the scatterplot of speed and stopping distance above. Also, notice that the prediction intervals for stopping distance are becoming increasingly wide as speed increases. This reflects the nonconstant variance in the original data.