At various points throughout the course, it will be necessary to build a statistical model that can be used to estimate or forecast a value based on a set of given values. When one has data where a dependent variable is assumed to depend upon a set of independent variables, linear regression is often applied as the first step in data analysis. This is because parameters for the linear model are very tractable and easily interpreted. You will be implementing this model in various ways using Python.
The Regression Equation
Data rarely fit a straight line exactly. Usually, you must be satisfied with rough predictions. Typically, you have a set of data with a scatter plot that appear to fit a straight line. This is called a line of best fit or least-squares regression line.
Collaborative Exercise
If you know a person's pinky (smallest) finger length, do you think you could predict that person's height? Collect data from your class (pinky finger length, in inches). The independent variable, x, is pinky finger length and the dependent variable, y, is height. For each set of data, plot the points on graph paper. Make your graph big enough and use a ruler. Then, by eye, draw a line that appears to fit the data. For your line, pick two convenient points and use them to find the slope of the line. Find the y-intercept of the line by extending your line so it crosses the y-axis. Using the slopes and the y-intercepts, write your equation of best fit. Do you think everyone will have the same equation? Why or why not? According to your equation, what is the predicted height for a pinky length of 2.5 inches?
Example 12.5
A random sample of 11 statistics students produced the data in Table 12.1, where x is the third exam score out of 80 and y is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score?
x (third exam score) | y (final exam score) |
---|---|
65 | 175 |
67 | 133 |
71 | 185 |
71 | 163 |
66 | 126 |
75 | 198 |
67 | 153 |
70 | 163 |
71 | 159 |
69 | 151 |
69 | 159 |

Try It 12.5
x (depth) | y (maximum dive time) |
---|---|
50 | 80 |
60 | 55 |
70 | 45 |
80 | 35 |
90 | 25 |
100 | 22 |
Table 12.2
The third exam score, x, is the independent variable, and the final exam score, y, is the dependent variable. We will plot a regression line that best fits the data. If each of you were to fit a line by eye, you would draw different lines. We can obtain a line of best fit using either the median - median line approach or by calculating the least-squares regression line.
x (third exam score) | y (final exam score) |
---|---|
65 | 175 |
66 | 126 |
67 | 133 |
67 | 153 |
69 | 151 |
69 | 159 |
70 | 163 |
71 | 159 |
71 | 163 |
71 | 185 |
75 | 198 |
Table 12.3
Group | x (third exam score) | y (final exam score) | Median x value | Median y value |
---|---|---|---|---|
1 | 65 66 67 67 |
126 133 153 175 |
66.5 |
143 |
2 | 69 69 70 |
151 159 163 |
69 | 159 |
3 | 71 71 71 75 |
159 163 185 198 |
71 | 174 |
Table 12.4
When this is completed, we can write the ordered pairs for the median values. This allows us to find the slope and y-intercept of the –median-median line.
The ordered pairs are (66.5, 143), (69, 159), and (71, 174).
The slope can be calculated using the formula
Substituting the median x and y values from the first and third groups gives
The y-intercept may be found using the formula
The sum of the median x values is 206.5, and the sum of the median y values is 476. Substituting these sums and the slope into the formula gives
The line of best fit is represented as
Thus, the equation can be written as
The median–median line may also be found using your graphing calculator. You can enter the x and y values into two separate lists; choose Stat, Calc, Med-Med, and press Enter. The slope, a, and y-intercept, b, will be provided. The calculator shows a slight deviation from the previous manual calculation as a result of rounding. Rounding to the nearest tenth, the calculator gives the –median-median line of
The ŷ is read y hat and is the estimated value of y. It is the value of y obtained using the regression line. It is not generally equal to y from data, but it is still important because it can help make predictions for other values.

The term
If the observed data point lies above the line, the residual is positive and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative and the line overestimates that actual data value for y.
In Figure 12.6,
For each data point, you can calculate the residuals or errors,
Each
For the example about the third exam scores and the final exam scores for the 11 statistics students, there are 11 data points. Therefore, there are
This is called the sum of squared errors (SSE).
Using calculus, you can determine the values of a and b that make the SSE a minimum. When you make the SSE a minimum, you have determined the points that are on the line of best fit. It turns out that the line of best fit has the equation
where
and
The sample means of the x values and the y values are
Least-Squares Criteria for Best Fit
Note
Third Exam vs. Final Exam Example

Understanding and Interpreting the y-intercept
The y-intercept, a, of the line describes where the plot line crosses the y-axis. The y-intercept of the best-fit line tells us the best value of the relationship when x is zero. In some cases, it does not make sense to figure out what y is when x = 0. For example, in the third exam vs. final exam example, the y-intercept occurs when the third exam score, or x, is zero. Since all the scores are grouped around a passing grade, there is no need to figure out what the final exam score, or y, would be when the third exam was zero.However, the y-intercept is very useful in many cases. For many examples in science, the y-intercept gives the baseline reading when the experimental conditions aren't applied to an experimental system. This baseline indicates how much the experimental condition affects the system. It could also be used to ensure that equipment and measurements are calibrated properly before starting the experiment.
Concentration (mM) | Absorbance (mAU) |
---|---|
125 | 0.021 |
250 | 0.023 |
500 | 0.068 |
750 | 0.086 |
1,000 | 0.105 |
1,500 | 0.124 |
2,000 | 0.146 |
Table 12.5

Understanding Slope
The slope of the line, b, describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.Interpretation of the Slope: The slope of the best-fit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average.
Using the TI-83, 83+, 84, 84+ Calculator
Using the Linear Regression T Test: LinRegTTest- In the STAT list editor, enter the x data in list L1 and the y data in list L2, paired so that the corresponding (x, y) values are next to each other in the lists. (If a particular pair of values is repeated, enter it as many times as it appears in the data.)
- On the STAT TESTS menu, scroll down and select LinRegTTest. (Be careful to select LinRegTTest. Some calculators may also have a different item called LinRegTInt.)
- On the LinRegTTest input screen, enter Xlist: L1, Ylist: L2, and Freq: 1.
- On the next line, at the prompt β or ρ, highlight ≠ 0 and press ENTER.
- Leave the line for RegEQ: blank.
- Highlight Calculate and press ENTER.

The output screen contains a lot of information. For now, let's focus on a few items from the output and return to the other items later.
The second line says y = a + bx. Scroll down to find the values a = –173.513 and b = 4.8273.
The equation of the best-fit line is ŷ = –173.51 + 4.83x.
The two items at the bottom are r2 = .43969 and r = .663. For now, just note where to find these values; we examine them in the next two sections.
Graphing the Scatter Plot and Regression Line
- We are assuming the x data are already entered in list L1 and the y data are in list L2.
- Press 2nd STATPLOT ENTER to use Plot 1.
- On the input screen for PLOT 1, highlight On, and press ENTER.
- For TYPE, highlight the first icon, which is the scatter plot, and press ENTER.
- Indicate Xlist: L1 and Ylist: L2.
- For Mark, it does not matter which symbol you highlight.
- Press the ZOOM key and then the number 9 (for menu item ZoomStat); the calculator fits the window to the data.
- To graph the best-fit line, press the Y= key and type the equation –173.5 + 4.83X into equation Y1. (The X key is immediately left of the STAT key.) Press ZOOM 9 again to graph it.
- Optional: If you want to change the viewing window, press the WINDOW key. Enter your desired window using Xmin, Xmax, Ymin, and Ymax.
NOTE
Another way to graph the line after you create a scatter plot is to use LinRegTTest.- Make sure you have done the scatter plot. Check it on your screen.
- Go to LinRegTTest and enter the lists.
- At RegEq, press VARS and arrow over to Y-VARS. Press 1 for 1:Function. Press 1 for 1:Y1. Then, arrow down to Calculate and do the calculation for the line of best fit.
- Press Y= (you will see the regression equation).
- Press GRAPH, and the line will be drawn.
The Correlation Coefficient r
Besides looking at the scatter plot and seeing that a line seems reasonable, how can you determine whether the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatter plot) of the strength of the relationship between x and y.The correlation coefficient, r, developed by Karl Pearson during the early 1900s, is numeric and provides a measure of the strength and direction of the linear association between the independent variable x and the dependent variable y.
If you suspect a linear relationship between x and y, then r can measure the strength of the linear relationship.
What the Value of r Tells Us
- The value of r is always between –1 and +1. In other words, –1 ≤ r ≤ 1.
- The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close to –1 or to +1 indicate a stronger linear relationship between x and y.
- If r = 0, there is absolutely no linear relationship between x and y (no linear correlation).
- If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation. In both these cases, all the original data points lie on a straight line. Of course, in the real world, this does not generally happen.
What the Sign of r Tells Us
- A positive value of r means that when x increases, y tends to increase and when x decreases, y tends to decrease (positive correlation).
- A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends to increase (negative correlation).
- The sign of r is the same as the sign of the slope, b, of the best-fit line.
Note
A strong correlation does not suggest that x causes y or y causes x. We say correlation does not imply causation.where n is the number of data points.

The Coefficient of Determination
The variable r2 is called the coefficient of determination and it is the square of the correlation coefficient, but it is usually stated as a percentage, rather than in decimal form. It has an interpretation in the context of the data:, when expressed as a percent, represents the percentage of variation in the dependent (predicted) variable y that can be explained by variation in the independent (explanatory) variable x using the regression (best-fit) line.
, when expressed as a percentage, represents the percentage of variation in y that is not explained by variation in x using the regression line. This can be seen as the scattering of the observed data points about the regression line.
Consider the third exam/final exam example introduced in the previous section.
- The line of best fit is: ŷ = –173.51 + 4.83x.
- The correlation coefficient is r = .6631.
- The coefficient of determination is r2 = .66312 = .4397.
- Approximately 44 percent of the variation (0.4397 is approximately 0.44) in the final exam grades can be explained by the variation in the grades on the third exam, using the best-fit regression line.
- Therefore, the rest of the variation (1 – 0.44 = 0.56 or 56 percent) in the final exam grades cannot be explained by the variation of the grades on the third exam with the best-fit regression line. These are the variation of the points that are not as close to the regression line as others.