Regression Basics

Read this chapter, which provides a general overview of regression. Focus on the Correlation and covariance section. How would you define correlation and covariance?

Simple regression and least squares method

In estimating the unknown parameters of the population for the regression line, we need to apply a method by which the vertical distances between the yet-to-be estimated regression line and the observed values in our sample are minimized. This minimized distance is called sample error, though it is more commonly referred to as residual and denoted by e.In more mathematical form, the difference between the y and its predicted value is the residual in each pair of observations for x and y. Obviously, some of these residuals will be positive (above the estimated line) and others will be negative (below the line). If we add all these residuals over the sample size and raise them to the power 2 in order to prevent the chance those positive and negative signs are cancelling each other out, we can write the following criterion for our minimization problem:

S=\operatorname{Min} \sum_{i=0}^{n}(y-\hat{y})^{\wedge} 2

S is the sum of squares of the residuals. By minimizing S over any given set of observations for x and y, we will get the following useful formula:

b=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sum(x-\bar{x})^{2}}

After computing the value of b from the above formula out of our sample data, and the means of the two series of data on xand y, one can simply recover the intercept of the estimated line using the following equation:

a=\bar{y}-b \bar{x}

For the sample data, and given the estimated intercept and slope, for each observation we can define a residual as:

e=y-\hat{y}=y-a-b x

Depending on the estimated values for intercept and slope, we can draw the estimated line along with all sample data in a yx panel. Such graphs are known as scatter diagrams. Consider our analysis of the price of one-bedroom apartments in Nelson, BC. We would collect data for y=price of one bedroom apartment, x1=its associated distance from downtown, and x2=the size of the apartment, as shown in Table 8.1.

y= price of apartments in $1000
x1= distance of each apartment from downtown in kilometres
x2= size of the apartment in square feet
y x1 x2
55 1.5 350
51 3 450
60 1.75 300
75 1 450
55.5 3.1 385
49 1.6 210
65 2.3 380
61.5 2 600
55 4 450
45 5 325
75 0.65 424
65 2 285

Table 8.1 Data for Price, Size, and Distance of Apartments in Nelson, BC


The graph (shown in Figure 8.1) is a scatter plot of the prices of the apartments and their distances from downtown, along with a proposed regression line.

Figure 8.1 Scatter Plot of Price, Distance from Downtown, along with a Proposed Regression Line


In order to plot such a scatter diagram, you can use many available statistical software packages including Excel, SAS, and Minitab. In this scatter diagram, a negative simple regression line has been shown. The estimated equation for this scatter diagram from Excel is:

\hat{y}=71.84-5.38 x

Where a=71.84 and b=-5.38. In other words, for every additional kilometre from downtown an apartment is located, the price of the apartment is estimated to be $5380 cheaper, i.e. 5.38*$1000=$5380. One might also be curious about the fitted values out of this estimated model. You can simply plug the actual value for x into the estimated line, and find the fitted values for the prices of the apartments. The residuals for all 12 observations are shown in Figure 8.2.

Figure 8.2


You should also notice that by minimizing errors, you have not eliminated them; rather, this method of least squares only guarantees the best fitted estimated regression line out of the sample data.

In the presence of the remaining errors, one should be aware of the fact that there are still other factors that might not have been included in our regression model and are responsible for the fluctuations in the remaining errors. By adding these excluded but relevant factors to the model, we probably expect the remaining error will show less meaningful fluctuations. In determining the price of these apartments, the missing factors may include age of the apartment, size, etc. Because this type of regression model does not include many relevant factors and assumes only a linear relationship, it is known as a simple linear regression model.