Linear Regression

Site: Saylor Academy
Course: CS250: Python for Data Science
Book: Linear Regression
Printed by: Guest user
Date: Wednesday, May 22, 2024, 2:28 AM

Description

At various points throughout the course, it will be necessary to build a statistical model that can be used to estimate or forecast a value based on a set of given values. When one has data where a dependent variable is assumed to depend upon a set of independent variables, linear regression is often applied as the first step in data analysis. This is because parameters for the linear model are very tractable and easily interpreted. You will be implementing this model in various ways using Python.

Linear Equations

Linear regression for two variables is based on a linear equation with one independent variable. The equation has the form

y=a+bx

where a and b are constant numbers.

The variable x is the independent variable; y is the dependent variable. Typically, you choose a value to substitute for the independent variable and then solve for the dependent variable.


Example 12.1

The following examples are linear equations.

y=3+2x
y=–0.01+1.2x

Try It 12.1

Is the following an example of a linear equation?
y = –0.125 – 3.5x

The graph of a linear equation of the form y = a + bx is a straight line. Any line that is not vertical can be described by this equation.


Example 12.2

Graph the equation y = –1 + 2x.

Graph of the equation y = -1 + 2x. This is a straight line that crosses the y-axis at -1 and is sloped up and to the right, r

Figure 12.2

Try It 12.2

Is the following an example of a linear equation? Why or why not?

This is a graph of an equation. The x-axis is labeled in intervals of 2 from 0 - 14; the y-axis is labeled in intervals of 2

Figure 12.3


Example 12.3

Aaron's Word Processing Service does word processing. The rate for services is $32 per hour plus a $31.50 one-time charge. The total cost to a customer depends on the number of hours it takes to complete the job.

Find the equation that expresses the total cost in terms of the number of hours required to complete the job.

Solution

Let x = the number of hours it takes to get the job done.
Let y = the total cost to the customer.

The $31.50 is a fixed cost. If it takes x hours to complete the job, then (32)(x) is the cost of the word processing only. The total cost is y = 31.50 + 32x.


Try It 12.3

Emma's Extreme Sports hires hang-gliding instructors and pays them a fee of $50 per class, as well as $20 per student in the class. The total cost Emma pays depends on the number of students in a class. Find the equation that expresses the total cost in terms of the number of students in a class.


Slope and y-interceptof a Linear Equation

For the linear equation y = a + bx, b = \text{slope} and a = \text{y-inttercept}. From algebra, recall that the slope is a number that describes the steepness of a line; the y-intercept is the y-coordinate of the point (0, a), where the line crosses the y-axis.

Please note that in previous courses you learned y=mx+b
was the slope-intercept form of the equation, where m represented the slope and b represented the y-intercept. In this text, the form y=a+bx is used, where a is the y-intercept and b is the slope. The key is remembering the coefficient of x is the slope, and the constant number is the y-intercept.

Three possible graphs of the equation y = a + bx. For the first graph, (a), b > 0 and so the line slopes upward to the right.


Figure 12.4 Three possible graphs of y = a + bx. (a) If b > 0, the line slopes upward to the right. (b) If b = 0, the line is horizontal. (c) If b < 0, the line slopes downward to the right.


Example 12.4

Svetlana tutors to make extra money for college. For each tutoring session, she charges a one-time fee of $25 plus $15 per hour of tutoring. A linear equation that expresses the total amount of money Svetlana earns for each session she tutors is y = 25 + 15x.

What are the independent and dependent variables? What is the y-intercept, and what is the slope? Interpret them using complete sentences.

Solution

The independent variable (x) is the number of hours Svetlana tutors each session. The dependent variable (y) is the amount, in dollars, Svetlana earns for each session.

The y-intercept is 25 (a = 25). At the start of the tutoring session, Svetlana charges a one-time fee of $25 (this is when x = 0). The slope is 15 (b = 15). For each session, Svetlana earns $15 for each hour she tutors.


Try It 12.4

Ethan repairs household appliances such as dishwashers and refrigerators. For each visit, he charges $25 plus $20 per hour of work. A linear equation that expresses the total amount of money Ethan earns per visit is y = 25 + 20x.

What are the independent and dependent variables? What is the y-intercept, and what is the slope? Interpret them using complete sentences.


Source: OpenStax, https://openstax.org/books/statistics/pages/12-introduction
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.

The Regression Equation

Data rarely fit a straight line exactly. Usually, you must be satisfied with rough predictions. Typically, you have a set of data with a scatter plot that appear to fit a straight line. This is called a line of best fit or least-squares regression line.


Collaborative Exercise

If you know a person's pinky (smallest) finger length, do you think you could predict that person's height? Collect data from your class (pinky finger length, in inches). The independent variable, x, is pinky finger length and the dependent variable, y, is height. For each set of data, plot the points on graph paper. Make your graph big enough and use a ruler. Then, by eye, draw a line that appears to fit the data. For your line, pick two convenient points and use them to find the slope of the line. Find the y-intercept of the line by extending your line so it crosses the y-axis. Using the slopes and the y-intercepts, write your equation of best fit. Do you think everyone will have the same equation? Why or why not? According to your equation, what is the predicted height for a pinky length of 2.5 inches?


Example 12.5

A random sample of 11 statistics students produced the data in Table 12.1, where x is the third exam score out of 80 and y is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score?

x (third exam score) y (final exam score)
65 175
67 133
71 185
71 163
66 126
75 198
67 153
70 163
71 159
69 151
69 159

Table 12.1
This is a scatter plot of the data provided. The third exam score is plotted on the x-axis, and the final exam score is plott

Figure 12.5 Using the x- and y-coordinates in the table, we plot the points on a graph to create the scatter plot showing the scores on the final exam based on scores from the third exam.


Try It 12.5
SCUBA divers have maximum dive times they cannot exceed when going to different depths. The data in Table 12.2 show different depths in feet, with the maximum dive times in minutes. Use your calculator to find the least squares regression line and predict the maximum dive time for 110 feet.

x (depth) y (maximum dive time)
50 80
60 55
70 45
80 35
90 25
100 22

Table 12.2

The third exam score, x, is the independent variable, and the final exam score, y, is the dependent variable. We will plot a regression line that best fits the data. If each of you were to fit a line by eye, you would draw different lines. We can obtain a line of best fit using either the median - median line approach or by calculating the least-squares regression line.

Let''s first find the line of best fit for the relationship between the third exam score and the final exam score using the median-median line approach. Remember that this is the data from Example 12.5 after the ordered pairs have been listed by ordering x values. If multiple data points have the same y values, then they are listed in order from least to greatest y (see data values where x = 71). We first divide our scores into three groups of approximately equal numbers of x values per group. The first and third groups have the same number of x values. We must remember first to put the x values in ascending order. The corresponding y values are then recorded. However, to find the median, we first must rearrange the y values in each group from the least value to the greatest value. Table 12.3 shows the correct ordering of the x values but does not show a reordering of the y values.

x (third exam score) y (final exam score)
65 175
66 126
67 133
67 153
69 151
69 159
70 163
71 159
71 163
71 185
75 198

Table 12.3

With this set of data, the first and last groups each have four x values and four corresponding y values. The second group has three x values and three corresponding y values. We need to organize the x and y values per group and find the median x and y values for each group. Let's now write out our y values for each group in ascending order. For group 1, the y values in order are 126, 133, 153, and 175. For group 2, the y values are already in order. For group 3, the y values are also already in order. We can represent these data as shown in Table 12.4, but notice that we have broken the ordered pairs; (65, 126) is not a data point in our original set:

Group x (third exam score) y (final exam score) Median x value Median y value
1 65
66
67
67
126
133
153
175
66.5


143


2 69
69
70
151
159
163
69 159
3 71
71
71
75
159
163
185
198
71 174

Table 12.4

When this is completed, we can write the ordered pairs for the median values. This allows us to find the slope and y-intercept of the –median-median line.

The ordered pairs are (66.5, 143), (69, 159), and (71, 174).

The slope can be calculated using the formula  m−\dfrac{y_2−y_1}{x_2−x_1}.
Substituting the median x and y values from the first and third groups gives m=\dfrac{174−143}{71−66.5}, which simplifies to m≈ 6.9.

The y-intercept may be found using the formula b=\dfrac{Σy−mΣx}{3}, which means the quantity of the sum of the median y values minus the slope times the sum of the median x values divided by three.

The sum of the median x values is 206.5, and the sum of the median y values is 476. Substituting these sums and the slope into the formula gives b=\dfrac{476−6.9(206.5)}{3}, which simplifies to b≈−316.3.

The line of best fit is represented as y=mx+b.

Thus, the equation can be written as y = 6.9x − 316.3.

The median–median line may also be found using your graphing calculator. You can enter the x and y values into two separate lists; choose Stat, Calc, Med-Med, and press Enter. The slope, a, and y-intercept, b, will be provided. The calculator shows a slight deviation from the previous manual calculation as a result of rounding. Rounding to the nearest tenth, the calculator gives the –median-median line of y=6.9x−315.5. Each point of data is of the the form (x, y), and each point of the line of best fit using least-squares linear regression has the form (x, ŷ).

The ŷ is read y hat and is the estimated value of y. It is the value of y obtained using the regression line. It is not generally equal to y from data, but it is still important because it can help make predictions for other values.
The scatter plot of exam scores with a line of best fit. One data point is highlighted along with the corresponding point on
Figure 12.6

The term y_0 – ŷ_0 = ε_0 is called the error or residual. It is not an error in the sense of a mistake. The absolute value of a residual measures the vertical distance between the actual value of y and the estimated value of y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line, or it measures how far the estimate is from the actual data value.

If the observed data point lies above the line, the residual is positive and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative and the line overestimates that actual data value for y.

In Figure 12.6, y_0 – ŷ_0 = ε_0 is the residual for the point shown. Here the point lies above the line and the residual is positive.

ε = the Greek letter epsilon

For each data point, you can calculate the residuals or errors, y_i – ŷ_i = ε_i for i = 1, 2, 3, . . . , 11.

Each |ε| is a vertical distance.

For the example about the third exam scores and the final exam scores for the 11 statistics students, there are 11 data points. Therefore, there are 11 ε values. If you square each ε and add them, you get the sum of ε squared from i = 1 to i = 11, as shown below.

 (ε_1)^2+(ε_2)^2+...+(ε_{11})^2=Σ^{11}_{i = 1}ε^2 .

This is called the sum of squared errors (SSE).

Using calculus, you can determine the values of a and b that make the SSE a minimum. When you make the SSE a minimum, you have determined the points that are on the line of best fit. It turns out that the line of best fit has the equation

ŷ=a+bx

where

a=\overline y−b\overline x

and b=\dfrac{∑(x−\overline x)(y−\overline y)}{∑(x−\overline x)^2}.

The sample means of the x values and the y values are \overline x and \overline y, respectively. The best-fit line always passes through the point (\overline x, \overline y.

The slope (b) can be written as b=r(\dfrac{s_y}{s_x}) where s_y = the standard deviation of the y values and s_x = the standard deviation of the x values. r is the correlation coefficient, which shows the relationship between the x and y values. This will be discussed in more detail in the next section.


Least-Squares Criteria for Best Fit

The process of fitting the best-fit line is called linear regression. We assume that the data are scattered about a straight line. To find that line, we minimize the sum of the squared errors (SSE), or make it as small as possible. Any other line you might choose would have a higher SSE than the best-fit line. This best-fit line is called the least-squares regression line.

Note
Computer spreadsheets, statistical software, and many calculators can quickly calculate the best-fit line and create the graphs. The calculations tend to be tedious if done by hand. Instructions to use the TI-83, TI-83+, and TI-84+ calculators to find the best-fit line and create a scatter plot are shown at the end of this section.


Third Exam vs. Final Exam Example

The graph of the line of best fit for the third exam/final exam example is as follows:

The scatter plot of exam scores with a line of best fit. One data point is highlighted along with the corresponding point on
Figure 12.7

The least-squares regression line (best-fit line) for the third exam/final exam example has the equation

ŷ=−173.51+4.83x.


Understanding and Interpreting the y-intercept

The y-intercept, a, of the line describes where the plot line crosses the y-axis. The y-intercept of the best-fit line tells us the best value of the relationship when x is zero. In some cases, it does not make sense to figure out what y is when x = 0. For example, in the third exam vs. final exam example, the y-intercept occurs when the third exam score, or x, is zero. Since all the scores are grouped around a passing grade, there is no need to figure out what the final exam score, or y, would be when the third exam was zero.

However, the y-intercept is very useful in many cases. For many examples in science, the y-intercept gives the baseline reading when the experimental conditions aren't applied to an experimental system. This baseline indicates how much the experimental condition affects the system. It could also be used to ensure that equipment and measurements are calibrated properly before starting the experiment.

In biology, the concentration of proteins in a sample can be measured using a chemical assay that changes color depending on how much protein is present. The more protein present, the darker the color. The amount of color can be measured by the absorbance reading. Table 12.5 shows the expected absorbance readings at different protein concentrations. This is called a standard curve for the assay.

Concentration (mM) Absorbance (mAU)
125 0.021
250 0.023
500 0.068
750 0.086
1,000 0.105
1,500 0.124
2,000 0.146

Table 12.5

The scatter plot Figure 12.8 includes the line of best fit.

This shows a scatter plot with a line of best fit. The scatter plot has points plotted at (0.021, 125), (0.023, 250), (0.068,

Figure 12.8

The y-intercept of this line occurs at 0.0226 mAU. This means the assay gives a reading of 0.0226 mAU when there is no protein present. That is, it is the baseline reading that can be attributed to something else, which, in this case, is some other non-protein chemicals that are absorbing light. We can tell that this line of best fit is reasonable because the y-intercept is small, close to zero. When there is no protein present in the sample, we expect the absorbance to be very small, or close to zero, as well.

Understanding Slope

The slope of the line, b, describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.

Interpretation of the Slope: The slope of the best-fit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average.

Third Exam vs. Final Exam Example
Slope: The slope of the line is b = 4.83.
Interpretation: For a 1-point increase in the score on the third exam, the final exam score increases by 4.83 points, on average.


Using the TI-83, 83+, 84, 84+ Calculator

Using the Linear Regression T Test: LinRegTTest

  1. In the STAT list editor, enter the x data in list L1 and the y data in list L2, paired so that the corresponding (x, y) values are next to each other in the lists. (If a particular pair of values is repeated, enter it as many times as it appears in the data.)
  2. On the STAT TESTS menu, scroll down and select LinRegTTest. (Be careful to select LinRegTTest. Some calculators may also have a different item called LinRegTInt.)
  3. On the LinRegTTest input screen, enter Xlist: L1, Ylist: L2, and Freq: 1.
  4. On the next line, at the prompt β or ρ, highlight ≠ 0 and press ENTER.
  5. Leave the line for RegEQ: blank.
  6. Highlight Calculate and press ENTER.
1. Image of calculator input screen for LinRegTTest with input matching the instructions above. 2.Image of corresponding outp
Figure 12.9

The output screen contains a lot of information. For now, let's focus on a few items from the output and return to the other items later.
The second line says y = a + bx. Scroll down to find the values a = –173.513 and b = 4.8273.

The equation of the best-fit line is ŷ = –173.51 + 4.83x.
The two items at the bottom are r2 = .43969 and r = .663. For now, just note where to find these values; we examine them in the next two sections.

Graphing the Scatter Plot and Regression Line

  1. We are assuming the x data are already entered in list L1 and the y data are in list L2.
  2. Press 2nd STATPLOT ENTER to use Plot 1.
  3. On the input screen for PLOT 1, highlight On, and press ENTER.
  4. For TYPE, highlight the first icon, which is the scatter plot, and press ENTER.
  5. Indicate Xlist: L1 and Ylist: L2.
  6. For Mark, it does not matter which symbol you highlight.
  7. Press the ZOOM key and then the number 9 (for menu item ZoomStat); the calculator fits the window to the data.
  8. To graph the best-fit line, press the Y= key and type the equation –173.5 + 4.83X into equation Y1. (The X key is immediately left of the STAT key.) Press ZOOM 9 again to graph it.
  9. Optional: If you want to change the viewing window, press the WINDOW key. Enter your desired window using Xmin, Xmax, Ymin, and Ymax.

NOTE
Another way to graph the line after you create a scatter plot is to use LinRegTTest.

  1. Make sure you have done the scatter plot. Check it on your screen.
  2. Go to LinRegTTest and enter the lists.
  3. At RegEq, press VARS and arrow over to Y-VARS. Press 1 for 1:Function. Press 1 for 1:Y1. Then, arrow down to Calculate and do the calculation for the line of best fit.
  4. Press Y= (you will see the regression equation).
  5. Press GRAPH, and the line will be drawn.

The Correlation Coefficient r

Besides looking at the scatter plot and seeing that a line seems reasonable, how can you determine whether the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatter plot) of the strength of the relationship between x and y.

The correlation coefficient, r, developed by Karl Pearson during the early 1900s, is numeric and provides a measure of the strength and direction of the linear association between the independent variable x and the dependent variable y.

If you suspect a linear relationship between x and y, then r can measure the strength of the linear relationship.

What the Value of r Tells Us
  • The value of r is always between –1 and +1. In other words, –1 ≤ r ≤ 1.
  • The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close to –1 or to +1 indicate a stronger linear relationship between x and y.
  • If r = 0, there is absolutely no linear relationship between x and y (no linear correlation).
  • If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation. In both these cases, all the original data points lie on a straight line. Of course, in the real world, this does not generally happen.

What the Sign of r Tells Us
  • A positive value of r means that when x increases, y tends to increase and when x decreases, y tends to decrease (positive correlation).
  • A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends to increase (negative correlation).
  • The sign of r is the same as the sign of the slope, b, of the best-fit line.

Note
A strong correlation does not suggest that x causes y or y causes x. We say correlation does not imply causation.

The correlation coefficient is calculated as the quantity of data points times the sum of the quantity of the x-coordinates times the y-coordinates, minus the quantity of the sum of the x-coordinates times the sum of the y-coordinates, all divided by the square root of the quantity of data points times the sum of the x-coordinates squared minus the square of the sum of the x-coordinates, times the number of data points times the sum of the y-coordinates squared minus the square of the sum of the y-coordinates. It can be summarized by the following equation:

r=\dfrac{nΣ(xy)−(Σx)(Σy)}{\sqrt{[nΣx^2−(Σx)^2][nΣy^2−(Σy)^2]}}

where n is the number of data points.

Three scatter plots with lines of best fit. The first scatterplot shows points ascending from the lower left to the upper rig

Figure 12.10 (a) A scatter plot showing data with a positive correlation: 0 < r < 1. (b) A scatter plot showing data with a negative correlation: –1 < r < 0. (c) A scatter plot showing data with zero correlation: r = 0.

The formula for r looks formidable. However, computer spreadsheets, statistical software, and many calculators can calculate r quickly. The correlation coefficient, r, is the bottom item in the output screens for the LinRegTTest on the TI-83, TI-83+, or TI-84+ calculator (see previous section for instructions).


The Coefficient of Determination

The variable r2 is called the coefficient of determination and it is the square of the correlation coefficient, but it is usually stated as a percentage, rather than in decimal form. It has an interpretation in the context of the data:

  • r^2, when expressed as a percent, represents the percentage of variation in the dependent (predicted) variable y that can be explained by variation in the independent (explanatory) variable x using the regression (best-fit) line.
  • 1 – r^2, when expressed as a percentage, represents the percentage of variation in y that is not explained by variation in x using the regression line. This can be seen as the scattering of the observed data points about the regression line.

Consider the third exam/final exam example introduced in the previous section.

  • The line of best fit is: ŷ = –173.51 + 4.83x.
  • The correlation coefficient is r = .6631.
  • The coefficient of determination is r2 = .66312 = .4397.
Interpret r^2 in the context of this example.
  • Approximately 44 percent of the variation (0.4397 is approximately 0.44) in the final exam grades can be explained by the variation in the grades on the third exam, using the best-fit regression line.
  • Therefore, the rest of the variation (1 – 0.44 = 0.56 or 56 percent) in the final exam grades cannot be explained by the variation of the grades on the third exam with the best-fit regression line. These are the variation of the points that are not as close to the regression line as others.

Testing the Significance of the Correlation Coefficient (Optional)

The correlation coefficient, r, tells us about the strength and direction of the linear relationship between x and y. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the correlation coefficient r and the sample size n, together.

We perform a hypothesis test of the significance of the correlation coefficient to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

The sample data are used to compute r, the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But, because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, r, is our estimate of the unknown population correlation coefficient.

  • The symbol for the population correlation coefficient is ρ, the Greek letter rho.
  • ρ = population correlation coefficient (unknown).
  • r = sample correlation coefficient (known; calculated from sample data).

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is close to zero or significantly different from zero. We decide this based on the sample correlation coefficient r and the sample size n.

If the test concludes the correlation coefficient is significantly different from zero, we say the correlation coefficient is significant.

  • Conclusion: There is sufficient evidence to conclude there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.
  • What the conclusion means: There is a significant linear relationship between x and y. We can use the regression line to model the linear relationship between x and y in the population.

If the test concludes the correlation coefficient is not significantly different from zero (it is close to zero), we say the correlation coefficient is not significant.

  • Conclusion: There is insufficient evidence to conclude there is a significant linear relationship between x and y because the correlation coefficient is not significantly different from zero.
  • What the conclusion means: There is not a significant linear relationship between x and y. Therefore, we cannot use the regression line to model a linear relationship between x and y in the population.


Note

  • If r is significant and the scatter plot shows a linear trend, the line can be used to predict the value of y for values of x that are within the domain of observed x values.
  • If r is not significant or if the scatter plot does not show a linear trend, the line should not be used for prediction.
  • If r is significant and the scatter plot shows a linear trend, the line may not be appropriate or reliable for prediction outside the domain of observed x values in the data.


Performing the Hypothesis Test

  • Null hypothesis: H0: ρ = 0.
  • Alternate hypothesis: Ha: ρ ≠ 0.

What the Hypothesis Means in Words:

  • Null hypothesis H0: The population correlation coefficient is not significantly different from zero. There is not a significant linear relationship (correlation) between x and y in the population.
  • Alternate hypothesis Ha: The population correlation coefficient is significantly different from zero. There is a significant linear relationship (correlation) between x and y in the population.

Drawing a Conclusion:There are two methods to make a conclusion. The two methods are equivalent and give the same result.

  • Method 1: Use the p-value.
  • Method 2: Use a table of critical values.

In this chapter, we will always use a significance level of 5 percent, α = 0.05.

Note

Using the p-value method, you could choose any appropriate significance level you want; you are not limited to using α = 0.05. But, the table of critical values provided in this textbook assumes we are using a significance level of 5 percent, α = 0.05. If we wanted to use a significance level different from 5 percent with the critical value method, we would need different tables of critical values that are not provided in this textbook.

METHOD 1: Using a p-value to Make a Decision

Using the TI-83, 83+, 84, 84+ Calculator

To calculate the p-value using LinRegTTEST:

  1. Complete the same steps as the LinRegTTest performed previously in this chapter, making sure on the line prompt for β or σ, ≠ 0 is highlighted.
  2. When looking at the output screen, the p-value is on the line that reads p =.
If the p-value is less than the significance level (α = 0.05):
  • Decision: Reject the null hypothesis.
  • Conclusion: There is sufficient evidence to conclude there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.
If the p-value is not less than the significance level (α = 0.05):
  • Decision: Do not reject the null hypothesis.
  • Conclusion: There is insufficient evidence to conclude there is a significant linear relationship between x and y because the correlation coefficient is not significantly different from zero.

You will use technology to calculate the p-value, but it is useful to know that the p-value is calculated using a t distribution with n – 2 degrees of freedom and that the p-value is the combined area in both tails.

An alternative way to calculate the p-value (p) given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n–2) in 2nd DISTR.

Third Exam vs. Final Exam Example: p-value Method
  • Consider the third exam/final exam example.
  • The line of best fit is ŷ = –173.51 + 4.83x, with r = 0.6631, and there are n = 11 data points.
  • Can the regression line be used for prediction? Given a third exam score (x value), can we use the line to predict the final exam score (predicted y value)?

H0: ρ = 0

Ha: ρ ≠ 0

α = 0.05

  • The p-value is 0.026 (from LinRegTTest on a calculator or from computer software).
  • The p-value, 0.026, is less than the significance level of α = 0.05.
  • Decision: Reject the null hypothesis H0.
  • Conclusion: There is sufficient evidence to conclude there is a significant linear relationship between the third exam score (x) and the final exam score (y) because the correlation coefficient is significantly different from zero.

Because r is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.


METHOD 2: Using a Table of Critical Values to Make a Decision

The 95 Percent Critical Values of the Sample Correlation Coefficient Table (Table 12.9) can be used to give you a good idea of whether the computed value of r is significant. Use it to find the critical values using the degrees of freedom, df = n – 2. The table has already been calculated with α = 0.05. The table tells you the positive critical value, but you should also make that number negative to have two critical values. If r is not between the positive and negative critical values, then the correlation coefficient is significant. If r is significant, then you may use the line for prediction. If r is not significant (between the critical values), you should not use the line to make predictions.

Example 12.6

Suppose you computed r = 0.801 using n = 10 data points. The degrees of freedom would be 8 (df = n – 2 = 10 – 2 = 8). Using Table 12.9 with df = 8, we find that the critical value is 0.632. This means the critical values are really ±0.632. Since r = 0.801 and 0.801 > 0.632, r is significant and the line may be used for prediction. If you view this example on a number line, it will help you to see that r is not between the two critical values.

 Figure 12.11 r is not between –0.632 and 0.632, so r is significant.

Figure 12.11 r is not between –0.632 and 0.632, so r is significant.


Try It 12.6

For a given line of best fit, you computed that r = 0.6501 using n = 12 data points, and the critical value found on the table is 0.576. Can the line be used for prediction? Why or why not?


Example 12.7

Suppose you computed r = –0.624 with 14 data points, where df = 14 – 2 = 12. The critical values are –0.532 and 0.532. Since –0.624 < –0.532, r is significant and the line can be used for prediction.

 Figure 12.12 r = –0.624 and –0.624 < –0.532. Therefore, r is significant.

Figure 12.12 r = –0.624 and –0.624 < –0.532. Therefore, r is significant.


Try It 12.7

For a given line of best fit, you compute that r = 0.5204 using n = 9 data points, and the critical values are ±0.666. Can the line be used for prediction? Why or why not?


Example 12.8

Suppose you computed r = 0.776 and n = 6, with df = 6 -– 2 = 4. The critical values are – 0.811 and 0.811. Since 0.776 is between the two critical values, r is not significant. The line should not be used for prediction.

 Figure 12.13 –0.811 < r = 0.776 < 0.811. Therefore, r is not significant.

Figure 12.13 –0.811 < r = 0.776 < 0.811. Therefore, r is not significant.


Try It 12.8

For a given line of best fit, you compute that r = –0.7204 using n = 8 data points, and the critical value is 0.707. Can the line be used for prediction? Why or why not?


Third Exam vs. Final Exam Example: Critical Value Method

Consider the third exam/final exam example. The line of best fit is: ŷ = –173.51 + 4.83x, with r = .6631, and there are n = 11 data points. Can the regression line be used for prediction? Given a third exam score (x value), can we use the line to predict the final exam score (predicted y value)?

  • H0: ρ = 0
  • Ha: ρ ≠ 0
  • α = 0.05
  • Use the 95 Percent Critical Values table for r with df = n – 2 = 11 – 2 = 9.
  • Using the table with df = 9, we find that the critical value listed is 0.602. Therefore, the critical values are ±0.602.
  • Since 0.6631 > 0.602, r is significant.
  • Decision: Reject the null hypothesis.
  • Conclusion: There is sufficient evidence to conclude there is a significant linear relationship between the third exam score (x) and the final exam score (y) because the correlation coefficient is significantly different from zero.

Because r is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.


Example 12.9

Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine whether r is significant and whether the line of best fit associated with each correlation coefficient can be used to predict a y value. If it helps, draw a number line.

  1. r = –0.567 and the sample size, n, is 19.

    To solve this problem, first find the degrees of freedom. df = n - 2 = 17.
    Then, using the table, the critical values are ±0.456.
    –0.567 < –0.456, or you may say that –0.567 is not between the two critical values.
    r is significant and may be used for predictions.
  2. r = 0.708 and the sample size, n, is 9.

    df = n - 2 = 7
    The critical values are ±0.666.
    0.708 > 0.666.
    r is significant and may be used for predictions.
  3. r = 0.134 and the sample size, n, is 14.

    df = 14 –- 2 = 12.
    The critical values are ±0.532. 0.134 is between –0.532 and 0.532. r is not significant and may not be used for predictions.
  4. r = 0 and the sample size, n, is 5.

    < It doesn’'t matter what the degrees of freedom are because r = 0 will always be between the two critical values, so r is not significant and may not be used for predictions.

Try It 12.9

For a given line of best fit, you compute that r = 0 using n = 100 data points. Can the line be used for prediction? Why


Assumptions in Testing the Significance of the Correlation Coefficient

Testing the significance of the correlation coefficient requires that certain assumptions about the data be satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between x and y in the sample data provides strong enough evidence that we can conclude there is a linear relationship between x and y in the population.

The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population. Examining the scatter plot and testing the significance of the correlation coefficient helps us determine whether it is appropriate to do this.

The assumptions underlying the test of significance are as follows:
  • There is a linear relationship in the population that models the sample data. Our regression line from the sample is our best estimate of this line in the population.
  • The y values for any particular x value are normally distributed about the line. This implies there are more y values scattered closer to the line than are scattered farther away. Assumption 1 implies that these normal distributions are centered on the line; the means of these normal distributions of y values lie on the line.
  • Normal distributions of all the y values have the same shape and spread about the line.
  • The residual errors are mutually independent (no pattern).
  • The data are produced from a well-designed, random sample or randomized experiment.

    Figure 12.14 The y values for each x value are normally distributed about the line with the same standard deviation. For ea

Figure 12.14 The y values for each x value are normally distributed about the line with the same standard deviation. For each x value, the mean of the y values lies on the regression line. More y values lie near the line than are scattered farther away from the line.

Prediction (Optional)

Recall the third exam/final exam example (Example 12.5).

We found the equation of the best-fit line for the final exam grade as a function of the grade on the third exam. We can now use the least-squares regression line for prediction.

Suppose you want to estimate, or predict, the mean final exam score of statistics students who received a 73 on the third exam. The exam scores (x values) range from 65 to 75. Since 73 is between the x values 65 and 75, substitute x = 73 into the equation. Then,

 \widehat y =−173.51+4.83(73)=179.08 .

We predict that statistics students who earn a grade of 73 on the third exam will earn a grade of 179.08 on the final exam, on average.


Example 12.10

Recall the third exam/final exam example.

a. What would you predict the final exam score to be for a student who scored a 66 on the third exam?

Solution 1
a. 145.27

b. What would you predict the final exam score to be for a student who scored a 90 on the third exam?

Solution 2
b. The x values in the data are between 65 and 75. 90 is outside the domain of the observed x values in the data (independent variable), so you cannot reliably predict the final exam score for this student. Even though it is possible to enter 90 into the equation for x and calculate a corresponding y value, the y value that you get will not be reliable.

To understand how unreliable the prediction can be outside the x values observed in the data, make the substitution x = 90 into the equation:

ŷ=–173.51+4.83(90)=261.19.

The final exam score is predicted to be 261.19. The most points that can be awarded for the final exam are 200.


Try It 12.10

Data are collected on the relationship between the number of hours per week practicing a musical instrument and scores on a math test. The line of best fit is as follows:

ŷ = 72.5 + 2.8x.

What would you predict the score on a math test will be for a student who practices a musical instrument for five hours a week?

Outliers

In some data sets, there are values (observed data points) called outliers. Outliers are observed data points that are far from the least-squares line. They have large errors, where the error or residual is not very close to the best-fit line.

Outliers need to be examined closely. Sometimes, they should not be included in the analysis of the data, like if it is possible that an outlier is a result of incorrect data. Other times, an outlier may hold valuable information about the population under study and should remain included in the data. The key is to examine carefully what causes a data point to be an outlier.

Besides outliers, a sample may contain one or a few points that are called influential points. Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line. To begin to identify an influential point, you can remove it from the data set and determine whether the slope of the regression line is changed significantly.

You also want to examine how the correlation coefficient, r, has changed. Sometimes, it is difficult to discern a significant change in slope, so you need to look at how the strength of the linear relationship has changed. Computers and many calculators can be used to identify outliers and influential points. Regression analysis can determine if an outlier is, indeed, an influential point. The new regression will show how omitting the outlier will affect the correlation among the variables, as well as the fit of the line. A graph showing both regression lines helps determine how removing an outlier affects the fit of the model.


Identifying Outliers

We could guess at outliers by looking at a graph of the scatter plot and best-fit line. However, we would like some guideline regarding how far away a point needs to be to be considered an outlier. As a rough rule of thumb, we can flag as an outlier any point that is located farther than two standard deviations above or below the best-fit line. The standard deviation used is the standard deviation of the residuals or errors.

We can do this visually in the scatter plot by drawing an extra pair of lines that are two standard deviations above and below the best-fit line. Any data points outside this extra pair of lines are flagged as potential outliers. Or, we can do this numerically by calculating each residual and comparing it with twice the standard deviation. With regard to the TI-83, 83+, or 84+ calculators, the graphical approach is easier. The graphical procedure is shown first, followed by the numerical calculations. You would generally need to use only one of these methods.


Example 12.11

In the third exam/final exam example, you can determine whether there is an outlier. If there is an outlier, as an exercise, delete it and fit the remaining data to a new line. For this example, the new line ought to fit the remaining data better. This means the SSE (sum of the squared errors) should be smaller and the correlation coefficient ought to be closer to 1 or –1.

Solution 1

Graphical Identification of Outliers

With the TI-83, 83+, or 84+ graphing calculators, it is easy to identify the outliers graphically and visually. If we were to measure the vertical distance from any data point to the corresponding point on the line of best fit and that distance were equal to 2 s or more, then we would consider the data point to be too far from the line of best fit. We need to find and graph the lines that are two standard deviations below and above the regression line. Any points that are outside these two lines are outliers. Let's call these lines Y2 and Y3.

As we did with the equation of the regression line and the correlation coefficient, we will use technology to calculate this standard deviation for us. Using the LinRegTTest with these data, scroll down through the output screens to find s = 16.412.

Line Y2 = –173.5 + 4.83 x – 2(16.4), and line Y3 = –173.5 + 4.83 x + 2(16.4),
where ŷ = –173.5 + 4.83 x is the line of best fit. Y2 and Y3 have the same slope as the line of best fit.

Graph the scatter plot with the best-fit line in equation Y1, then enter the two extra lines as Y2 and Y3 in the Y= equation editor. Press ZOOM-9 to get a good view. You will see that the only point that is not between Y2 and Y3 is the point (65, 175). On the calculator screen, it is barely outside these lines, but it is considered an outlier because it is more than two standard deviations away from the best-fit line. The outlier is the student who had a grade of 65 on the third exam and 175 on the final exam.

Sometimes a point is so close to the lines used to flag outliers on the graph that it is difficult to tell whether the point is between or outside the lines. On a computer, enlarging the graph may help; on a small calculator screen, zooming in may make the graph clearer. Note that when the graph does not give a clear enough picture, you can use the numerical comparisons to identify outliers.


Figure 12.15
 

Try It 12.11

Identify the potential outlier in the scatter plot. The standard deviation of the residuals, or errors, is approximately 8.6.


Figure 12.16

Numerical Identification of Outliers

In Table 12.6, the first two columns include the third exam and final exam data. The third column shows the predicted ŷ values calculated from the line of best fit: ŷ = –173.5 + 4.83 x. The residuals, or errors, that were mentioned in Section 3 of this chapter have been calculated in the fourth column of the table: Observed y value – predicted y value = yŷ.

s is the standard deviation of all the yŷ = ε values, where n is the total number of data points. If each residual is calculated and squared, and the results are added, we get the SSE. The standard deviation of the residuals is calculated from the SSE as

s=\sqrt{\dfrac{SSE}{n-2}}

Note

We divide by ( n – 2) because the regression model involves two estimates.

Rather than calculate the value of s ourselves, we can find s using a computer or calculator. For this example, the calculator function LinRegTTest found s = 16.4 as the standard deviation of the residuals 35; –17; 16; –6; –19; 9; 3; –1; –10; –9; –1 .

x y ŷ yŷ
65 175 140 175 – 140 = 35
67 133 150 133 – 150= –17
71 185 169 185 – 169 = 16
71 163 169 163 – 169 = –6
66 126 145 126 – 145 = –19
75 198 189 198 – 189 = 9
67 153 150 153 – 150 = 3
70 163 164 163 – 164 = –1
71 159 169 159 – 169 = –10
69 151 160 151 – 160 = –9
69 159 160 159 – 160 = –1

Table 12.6

We are looking for all data points for which the residual is greater than 2 s = 2(16.4) = 32.8 or less than –32.8. Compare these values with the residuals in column four of the table. The only such data point is the student who had a grade of 65 on the third exam and 175 on the final exam; the residual for this student is 35.


How Does the Outlier Affect the Best-Fit Line?

Numerically and graphically, we have identified point (65, 175) as an outlier. Recall that recalculation of the least-squares regression line and summary statistics, following deletion of an outlier, may be used to determine whether an outlier is also an influential point. This process also allows you to compare the strength of the correlation of the variables and possible changes in the slope both before and after the omission of any outliers.

Compute a new best-fit line and correlation coefficient using the 10 remaining points.

On the TI-83, TI-83+, or TI-84+ calculators, delete the outlier from L1 and L2. Using the LinRegTTest, found under Stat and Tests, the new line of best fit and correlation coefficient are the following:

ŷ=−355.19+7.39x and r=0.9121.

The slope is now 7.39, compared to the previous slope of 4.83. This seems significant, but we need to look at the change in r-values as well. The new line shows r=0.9121, which indicates a stronger correlation than the original line, with r=0.6631, since r=0.9121 is closer to 1. This means the new line is a better fit to the data values. The line can better predict the final exam score given the third exam score. It also means the outlier of (65, 175) was an influential point, since there is a sizeable difference in r-values. We must now decide whether to delete the outlier. If the outlier was recorded erroneously, it should certainly be deleted. Because it produces such a profound effect on the correlation, the new line of best fit allows for better prediction and an overall stronger model.

You may use Excel to graph the two least-squares regression lines and compare the slopes and fit of the lines to the data, as shown in Figure 12.17.


Figure 12.17

You can see that the second graph shows less deviation from the line of best fit. It is clear that omission of the influential point produced a line of best fit that more closely models the data.


Numerical Identification of Outliers: Calculating s and Finding Outliers Manually

If you do not have the function LinRegTTest on your calculator, then you must calculate the outlier in the first example by doing the following. First, square each | yŷ|.

The squares are 352; 172; 162; 62; 192; 92; 32; 12; 102; 92; 12.

Then, add (sum) all the | yŷ| squared terms using the formula

Σ_{i = 1}11(|y_i−ŷ_i|)^2=Σ_{i = 1}11ε_i^2 (Recall that yiŷi = εi).

= 352 + 172 + 162 + 62 + 192 + 92 + 32 + 12 + 102 + 92 + 12

= 2,440 = SSE.

The result, SSE, is the sum of squared errors.

Next, calculate s, the standard deviation of all the yŷ = ε-values where n = the total number of data points.

The calculation is s=\sqrt{\dfrac{SSE}{n-2}}

For the third exam/final exam example, s=\sqrt{\dfrac{2440}{11-2}} = 16.47.

Next, multiply s by 2: (2)(16.47) = 32.94 32.94 is two standard deviations away from the mean of the 

by 2: (2)(16.47) = 32.94

32.94 is two standard deviations away from the mean of the yŷ values.

If we were to measure the vertical distance from any data point to the corresponding point on the line of best fit and that distance is at least 2 s, then we would consider the data point to be too far from the line of best fit. We call that point a potential outlier.

For the example, if any of the | yŷ| values are at least 32.94, the corresponding ( x, y) data point is a potential outlier.

For the third exam/final exam example, all the | yŷ| values are less than 31.29 except for the first one, which is 35:

35 > 31.29. That is, | yŷ| ≥ (2)( s).

The point that corresponds to | yŷ| = 35 is (65, 175). Therefore, the data point (65, 175) is a potential outlier. For this example, we will delete it. (Remember, we do not always delete an outlier).


Note

When outliers are deleted, the researcher should either record that data were deleted, and why, or the researcher should provide results both with and without the deleted data. If data are erroneous and the correct values are known (e.g., student 1 actually scored a 70 instead of a 65), then this correction can be made to the data.

The next step is to compute a new best-fit line using the 10 remaining points. The new line of best fit and the correlation coefficient are

ŷ = –355.19 + 7.39 x and r = .9121.


Example 12.12

Using this new line of best fit (based on the remaining 10 data points in the third exam/final exam example), what would a student who receives a 73 on the third exam expect to receive on the final exam? Is this the same as the prediction made using the original line?

Solution 1

Using the new line of best fit, ŷ = –355.19 + 7.39(73) = 184.28. A student who scored 73 points on the third exam would expect to earn 184 points on the final exam. The original line predicted that ŷ = –173.51 + 4.83(73) = 179.08, so the prediction using the new line with the outlier eliminated differs from the original prediction.


Try It 12.12

The data points for the graph from the third exam/final exam example are as follows: (1, 5), (2, 7), (2, 6), (3, 9), (4, 12), (4, 13), (5, 18), (6, 19), (7, 12), and (7, 21). Remove the outlier and recalculate the line of best fit. Find the value of ŷ when x = 10.


Example 12.13

The consumer price index (CPI) measures the average change over time in prices paid by urban consumers for consumer goods and services. The CPI affects nearly all Americans because of the many ways it is used. One of its biggest uses is as a measure of inflation. By providing information about price changes in the nation's economy to government, businesses, and labor forces, the CPI helps them make economic decisions. The president, U.S. Congress, and the Federal Reserve Board use CPI trends to form monetary and fiscal policies. In the following table, x is the year and y is the CPI.

x y x y
1915 10.1 1969 36.7
1926 17.7 1975 49.3
1935 13.7 1979 72.6
1940 14.7 1980 82.4
1947 24.1 1986 109.6
1952 26.5 1991 130.7
1964 31.0 1999 166.6

Table 12.7
  1. Draw a scatter plot of the data.
  2. Calculate the least-squares line. Write the equation in the form ŷ = a + bx.
  3. Draw the line on a scatter plot.
  4. Find the correlation coefficient. Is it significant?
  5. What is the average CPI for the year 1990?

Solution 1

  1. See Figure 12.18.
  2. Using our calculator, ŷ = –3204 + 1.662 x is the equation of the line of best fit.
  3. See Figure 12.18.
  4. r = 0.8694. The number of data points is n = 14. Use the 95 Percent Critical Values of the Sample Correlation Coefficient table at the end of Chapter 12: In this case, df = 12. The corresponding critical values from the table are ±0.532. Since 0.8694 > 0.532, r is significant. We can use the predicted regression line we found above to make the prediction for x = 1990.
  5. ŷ = –3204 + 1.662(1990) = 103.4 CPI.

Figure 12.18

Note

In the example, notice the pattern of the points compared with the line. Although the correlation coefficient is significant, the pattern in the scatter plot indicates that a curve would be a more appropriate model to use than a line. In this example, a statistician would prefer to use other methods to fit a curve to these data, rather than model the data with the line we found. In addition to doing the calculations, it is always important to look at the scatter plot when deciding whether a linear model is appropriate.

If you are interested in seeing more years of data, visit the Bureau of Labor Statistics CPI website (ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt). Our data are taken from the column Annual Avg. (third column from the right). For example, you could add more current years of data. Try adding the more recent years: 2004, CPI = 188.9; 2008, CPI = 215.3; and 2011, CPI = 224.9. See how this affects the model. (Check: ŷ = –4436 + 2.295 x; r = 0.9018. Is r significant? Is the fit better with the addition of the new points?)


Try It 12.13

The following table shows economic development measured in per capita income (PCINC).

Year PCINC Year PCINC
1870 340 1920 1,050
1880 499 1930 1,170
1890 592 1940 1,364
1900 757 1950 1,836
1910 927 1960 2,132

Table 12.8

  1. What are the independent and dependent variables?
  2. Draw a scatter plot.
  3. Use regression to find the line of best fit and the correlation coefficient.
  4. Interpret the significance of the correlation coefficient.
  5. Is there a linear relationship between the variables?
  6. Find the coefficient of determination and interpret it.
  7. What is the slope of the regression equation? What does it mean?
  8. Use the line of best fit to estimate PCINC for 1900 and for 2000.
  9. Determine whether there are any outliers.

95 Percent Critical Values of the Sample Correlation Coefficient Table

Degrees of Freedom: n – 2 Critical Values: + and –
1 0.997
2 0.950
3 0.878
4 0.811
5 0.754
6 0.707
7 0.666
8 0.632
9 0.602
10 0.576
11 0.555
12 0.532
13 0.514
14 0.497
15 0.482
16 0.468
17 0.456
18 0.444
19 0.433
20 0.423
21 0.413
22 0.404
23 0.396
24 0.388
25 0.381
26 0.374
27 0.367
28 0.361
29 0.355
30 0.349
40 0.304
50 0.273
60 0.250
70 0.232
80 0.217
90 0.205
100 0.195

Table 12.9