This section explains linear regression, from presenting the data to using scatter plots to identify the linear pattern. It then fits a linear model using least squares estimation and addresses statistical inferences on correlation coefficient and slope parameter.
A Complete Example
Learning Objective
- To see a complete linear correlation and regression analysis, in a practical setting, as a cohesive whole.
In the preceding sections numerous concepts were introduced and illustrated, but the analysis was broken into disjoint pieces by sections. In this section we will go through a complete example of the use of correlation and regression analysis of data from start to finish, touching on all the topics of this chapter in sequence.
In general educators are convinced that, all other factors being equal, class attendance has a significant bearing on course performance. To investigate the relationship between attendance and performance, an education researcher selects for study a multiple section introductory statistics course at a large university. Instructors in the course agree to keep an accurate record of attendance throughout one semester. At the end of the semester 26 students are selected a random. For each student in the sample two measurements are taken: , the number of days the student was absent, and
, the student's score on the common final exam in the course. The data are summarized in Table 10.4 "Absence and Score Data".
Table 10.4 Absence and Score Data
Absences | Score | Absences | Score |
---|---|---|---|
x | y | x | y |
2 | 76 | 4 | 41 |
7 | 29 | 5 | 63 |
2 | 96 | 4 | 88 |
7 | 63 | 0 | 98 |
2 | 79 | 1 | 99 |
7 | 71 | 0 | 89 |
0 | 88 | 1 | 96 |
0 | 92 | 3 | 90 |
6 | 55 | 1 | 90 |
6 | 70 | 3 | 68 |
2 | 80 | 1 | 84 |
2 | 75 | 3 | 80 |
1 | 63 | 1 | 78 |
A scatter plot of the data is given in Figure 10.13 "Plot of the Absence and Exam Score Pairs". There is a downward trend in the plot which indicates that on average students with more absences tend to do worse on the final examination.
Figure 10.13 Plot of the Absence and Exam Score Pairs

The trend observed in Figure 10.13 "Plot of the Absence and Exam Score Pairs" as well as the fairly constant width of the apparent band of points in the plot makes it reasonable to assume a relationship between and
of the form
where and
are unknown parameters and ε is a normal random variable with mean zero and unknown standard deviation
. Note carefully that this model is being proposed for the population of all students taking this course, not just those taking it this semester, and certainly not just those in the sample. The numbers
,
, and σ are parameters relating to this large population.
First we perform preliminary computations that will be needed later. The data are processed in Table 10.5 "Processed Absence and Score Data".
Table 10.5 Processed Absence and Score Data
x | y | x2 | xy | y2 | x | y | x2 | xy | y2 |
---|---|---|---|---|---|---|---|---|---|
2 | 76 | 4 | 152 | 5776 | 4 | 41 | 16 | 164 | 1681 |
7 | 29 | 49 | 203 | 841 | 5 | 63 | 25 | 315 | 3969 |
2 | 96 | 4 | 192 | 9216 | 4 | 88 | 16 | 352 | 7744 |
7 | 63 | 49 | 441 | 3969 | 0 | 98 | 0 | 0 | 9604 |
2 | 79 | 4 | 158 | 6241 | 1 | 99 | 1 | 99 | 9801 |
7 | 71 | 49 | 497 | 5041 | 0 | 89 | 0 | 0 | 7921 |
0 | 88 | 0 | 0 | 7744 | 1 | 96 | 1 | 96 | 9216 |
0 | 92 | 0 | 0 | 8464 | 3 | 90 | 9 | 270 | 8100 |
6 | 55 | 36 | 330 | 3025 | 1 | 90 | 1 | 90 | 8100 |
6 | 70 | 36 | 420 | 4900 | 3 | 68 | 9 | 204 | 4624 |
2 | 80 | 4 | 160 | 6400 | 1 | 84 | 1 | 84 | 7056 |
2 | 75 | 4 | 150 | 5625 | 3 | 80 | 9 | 240 | 6400 |
1 | 63 | 1 | 63 | 3969 | 1 | 78 | 1 | 78 | 6084 |
Adding up the numbers in each column in Table 10.5 "Processed Absence and Score Data" gives
Rounding these numbers to two decimal places, the least squares regression line for these data is
The goodness of fit of this line to the scatter plot, the sum of its squared errors, is
This number is not particularly informative in itself, but we use it to compute the important statistic
The size and sign of the slope
Since 0 is in the range of x-values in the data set, the y-intercept also has meaning in this problem. It is an estimate of the average grade on the final exam of all students who have perfect attendance. The predicted average of such students is
Before we use the regression equation further, or perform other analyses, it would be a good idea to examine the utility of the linear regression model. We can do this in two ways: 1) by computing the correlation coefficient
The correlation coefficient r is
Turning to the test of hypotheses, let us test at the commonly used 5% level of significance. The test is
From Figure 12.3 "Critical Values of ", with
which falls in the rejection region. We reject
As already noted, the value
or (−7.38,−3.08). We are 95% confident that, among all students who ever take this course, for each additional class missed the average score on the final exam goes down by between 3.08 and 7.38 points.
This text was adapted by Saylor Academy under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License without attribution as requested by the work's original creator or licensor.