Introduction to Linear Regression

Site: Saylor Academy
Course: MA121: Introduction to Statistics
Book: Introduction to Linear Regression
Printed by: Guest user
Date: Tuesday, April 23, 2024, 8:00 AM

Description

This section defines simple linear regression, uses scatter plots to reveal linear patterns, and talks about prediction error. It also discusses how to compute regression line by minimizing squared errors.

Introduction to Linear Regression

Learning Objectives

  1. Define linear regression
  2. Identify errors of prediction in a scatter plot with a regression line

In simple linear regression, we predict scores on one variable from the scores on a second variable. The variable we are predicting is called the criterion variable and is referred to as Y. The variable we are basing our predictions on is called the predictor variable and is referred to as X. When there is only one predictor variable, the prediction method is called simple regression. In simple linear regression, the topic of this section, the predictions of Y when plotted as a function of X form a straight line.

The example data in Table 1 are plotted in Figure 1. You can see that there is a positive relationship between X and Y. If you were going to predict Y from X, the higher the value of X, the higher your prediction of Y.

Table 1. Example data.

X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25


Figure 1. A scatter plot of the example data.

Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line is called a regression line. The black diagonal line in Figure 2 is the regression line and consists of the predicted score on Y for each possible value of X. The vertical lines from the points to the regression line represent the errors of prediction. As you can see, the red point is very near the regression line; its error of prediction is small. By contrast, the yellow point is much higher than the regression line and therefore its error of prediction is large.


Figure 2. A scatter plot of the example data. The black line consists of the predictions, the points are the actual data, and the vertical lines between the points and the black line represent errors of prediction.

The error of prediction for a point is the value of the point minus the predicted value (the value on the line). Table 2 shows the predicted values (Y') and the errors of prediction (Y-Y'). For example, the first point has a Y of 1.00 and a predicted Y (called Y') of 1.21. Therefore, its error of prediction is -0.21.

Table 2. Example data.

X Y Y' Y-Y' (Y-Y')^2
1.00 1.00 1.210 -0.210 0.044
2.00 2.00 1.635 0.365 0.133
3.00 1.30 2.060 -0.760 0.578
4.00 3.75 2.485 1.265 1.600
5.00 2.25 2.910 -0.660 0.436


You may have noticed that we did not specify what is meant by "best-fitting line." By far, the most commonly-used criterion for the best-fitting line is the line that minimizes the sum of the squared errors of prediction. That is the criterion that was used to find the line in Figure 2. The last column in Table 2 shows the squared errors of prediction. The sum of the squared errors of prediction shown in Table 2 is lower than it would be for any other regression line.

The formula for a regression line is

Y' = bX + A

where Y' is the predicted score, b is the slope of the line, and A is the Y intercept. The equation for the line in Figure 2 is

Y' = 0.425X + 0.785

For X = 1,

Y' = (0.425)(1) + 0.785 = 1.21.

For  X = 2,

Y' = (0.425)(2) + 0.785 = 1.64.


Computing the Regression Line

In the age of computers, the regression line is typically computed with statistical software. However, the calculations are relatively easy, and are given here for anyone who is interested. The calculations are based on the statistics shown in Table 3. M_X is the mean of X, M_Y is the mean of Y, s_X is the standard deviation of X, s_Y is the standard deviation of Y, and r is the correlation between X and Y.

Table 3. Statistics for computing the regression line.

M_X M_Y s_X s_Y r
3 2.06 1.581 1.072 0.627


The slope (b) can be calculated as follows:

b = r s_Y/s_X

and the intercept (A) can be calculated as

A = M_Y - bM_X.

For these data,

b = (0.627)(1.072)/1.581 = 0.425

A = 2.06 - (0.425)(3) = 0.785

Note that the calculations have all been shown in terms of sample statistics rather than population parameters. The formulas are the same; simply use the parameter values for means, standard deviations, and the correlation.


Standardized Variables

The regression equation is simpler if variables are standardized so that their means are equal to 0 and standard deviations are equal to 1, for then b = r and A = 0. This makes the regression line:

Z_{Y'} = (r)(Z_X)

where Z_{Y'} is the predicted standard score for Y, r is the correlation, and Z_X is the standardized score for X. Note that the slope of the regression equation for standardized variables is r.


A Real Example

The case study "SAT and College GPA" contains high school and university grades for 105 computer science majors at a local state school. We now consider how we could predict a student's university GPA if we knew his or her high school GPA.

Figure 3 shows a scatter plot of University GPA as a function of High School GPA. You can see from the figure that there is a strong positive relationship. The correlation is 0.78. The regression equation is

\text {University GPA'} = (0.675) \text{ (High School
    GPA)} + 1.097

Therefore, a student with a high school GPA of 3 would be predicted to have a university GPA of

\text{ University GPA'} = (0.675)(3)
    + 1.097 = 3.12.


Figure 3. University GPA as a function of High School GPA.


Assumptions

It may surprise you, but the calculations shown in this section are assumption-free. Of course, if the relationship between X and Y were not linear, a different shaped function could fit the data better. Inferential statistics in regression are based on several assumptions, and these assumptions are presented in a later section of this chapter.



Source: David M. Lane , https://onlinestatbook.com/2/regression/intro.html
Public Domain Mark This work is in the Public Domain.

Video

 

Questions

Question 1 out of 7.
The formula for a regression equation is Y' = 3X - 2. What would be the predicted Y score for a person scoring 4 on X?


Question 2 out of 7.
Suppose it is possible to predict a person's score on Test B from the person's score on Test A. The regression equation is: B' = 2.3A + 9.5. What is a person's predicted score on Test B assuming this person got a 40 on Test A?


Question 3 out of 7.
Suppose a person got a score of 32.5 on Test A and a score of 95.25 on Test B. Using the same regression equation as in the previous problem (B' = 2.3A
 + 9.5), what is the error of prediction for this person?


Question 4 out of 7.
What is the most common criterion used to determine the best-fitting line?

The line that goes through the most points

The line that has the same number of points above it as below it

The line that minimizes the sum of squared errors of prediction


Question 5 out of 7.
The mean of X is 3 and the mean of Y is 7. The regression line that predicts Y from X necessarily goes through the point (3,7).

True

False


Question 6 out of 7.
You want to be able to predict a woman's shoe size from her height. You have gathered this information from your female classmates. The mean height of women in your class is 64 inches, and the standard deviation is 2 inches. The mean shoe size is 8, and the standard deviation is 1. The correlation between these two variables is .5. What is the slope of the regression line?


Question 7 out of 7.
What is the slope of the regression line when predicting Y from X?

X	Y
 10	 11
 16	 16
 13	 14
 10	 12
 12	 12
 13	 10
 11	 12
 11	 11
  9	  9
 10	 12
  8	 12
 13	 12
  8	  8
 10	 12
 12	 12
  8	  8
  6	 11
  8	  7
 12	 13
 13	 10
  6	  9
 13	 12
 12	 10
 13	 12
  8	  9

Answers

  1. Plug  X = 4 into the equation to find  Y'= 3(4) - 2 = 10

  2. Plug  A = 40 into the equation to find  B'= 2.3(40) + 9.5 = 101.5

  3. Predicted value,  B' = 2.3(32.5) + 9.5 = 84.25 ; Error of prediction =  B - B' = 95.25 - 84.25 = 11

  4. Plug   A = 40 into the equation to find  B'= 2.3(40) + 9.5 = 101.5

  5. Someone who scored the mean on  X would be predicted to score the mean on  Y .

  6.  b = r(sY/sX) = .5(1/2) = .25

  7.  b = 0.9143