Regression Basics

Read this chapter, which provides a general overview of regression. Focus on the Correlation and covariance section. How would you define correlation and covariance?

Simple regression and least squares method

Coefficient of Determination

If you use the sample mean to predict the amount of the price of ach apartment, your error is (y-\bar{y}) or each apartment. Squaring each error so that worries about signs are overcome, and then adding the squared errors together, gives you a measure of the total mistake you make if you want to predict y. Your total mistake is \Sigma(y-\bar{y})^{2}. The total mistake you make using the regression model would be \Sigma(y-\hat{y})^{2}. The difference between the mistakes, a raw measure of how much your prediction has improved, is \Sigma(\hat{y}-\bar{y})^{2}. To make this raw measure of the improvement meaningful, you need to compare it to one of the two measures of the total mistake. This means that there are two measures of "how good" your regression equation is. One compares the improvement to the mistakes still made with regression. The other compares the improvement to the mistakes that would be made if the mean was used to predict. The first is called an F-score because the sampling distribution of these measures follows the F-distribution, "F-test and One-Way ANOVA". he second is called R^{2}, or the coefficient of determination.

All of these mistakes and improvements have names, and talking about them will be easier once you know those names. The total mistake made using the sample mean to predict, \Sigma(y-\bar{y})^{2}, is called the sum of squares, total. The total mistake made using the regression, \Sigma(y-\hat{y})^{2}, is called the sum of squares, error (residual). The general improvement made by using regression, \Sigma(\hat{y}-\bar{y})^{2} is called the sum of squares, regression or sum of squares, model. You should be able to see that:

sum of squares, total = sum of squares, regression + sum of squares, error (residual)

\sum(y-\bar{y})^{2}=\sum(\hat{y}-\bar{y})^{2}+\sum(y-\hat{y})^{2}

In other words, the total variations in y can be partitioned into two sources: the explained variations and the unexplained variations. Further, we can rewrite the above equation as:

S S T=S S R+S S E

where SST stands for sum of squares due to total variations, SSR measures the sum of squares due to the estimated regression model that is explained by variable x, and SSE measures all the variations due to other factors excluded from the estimated model.

Going back to the idea of goodness of fit, one should be able to easily calculate the percentage of each variation with respect to the total variations. In particular, the strength of the estimated regression model can now be measured. Since we are interested in the explained part of the variations by the estimated model, we simply divide both sides of the above equation by SST, and we get:

S S T / S S T=S S R / S S T+S S E / S S T

We then isolate this equation for the explained proportion, also known as R-square:

R^{2}=1-S S E / S S T

Only in cases where an intercept is included in a simple regression model will the value of R^{2}be bounded between zero and one. The closer R2 is to one, the stronger the model is. Alternatively, R^{2} is also found by:

R^{2}=\frac{\sum \text { of Squares due to Regression }}{\sum \text { of Squares of Total }}

This is the ratio of the improvement made using the regression to the mistakes made using the mean. The numerator is the improvement regression makes over using the mean to predict; the denominator is the mistakes (errors) made using the mean. Thus R^{2} simply shows what proportion of the mistakes made using the mean are eliminated by using regression.

In the case of the market for one-bedroom apartments in Nelson, BC, the percentage of the variations in price for the apartments is estimated to be around 50%. This indicates that only half of the fluctuations in apartment prices with respect to the average price can be explained by the apartments' distance from downtown. The other 50% are not controlled (that is, they are unexplained) and are subject to further research. One typical approach is to add more relevant factors to the simple regression model. In this case, the estimated model is referred to as a multiple regression model.

While R^{2} is not used to test hypotheses, it has a more intuitive meaning than the F-score. The F-score is the measure usually used in a hypothesis test to see if the regression made a significant improvement over using the mean. It is used because the sampling distribution of F-scores that it follows is printed in the tables at the back of most statistics books, so that it can be used for hypothesis testing. It works no matter how many explanatory variables are used. More formally, consider a population of multivariate observations, \left(y, x_{1}, x_{2}, \ldots, x_{\mathrm{m}}\right), where there is no linear relationship between y and the x's, so that y \neq \mathrm{f}\left(y, x_{1}, x_{2}, \ldots, x_{\mathrm{m}}\right). If samples of n observations are taken, a regression equation estimated for each sample, and a statistic, F, found for each sample regression, then those F's will be distributed like those shown in Figure 8.5, the F-table with (m, n-m-1) df.


Figure 8.5 Interactive Excel Template of an F-Table

The value of F can be calculated as:

F=\frac{\frac{\sum \text { of Squares Regression }}{m}}{\frac{\sum \text { of Squares Residual }}{(n-m-l)}}

=\frac{\frac{\text { improvement made }}{m}}{\frac{m i s t a k e s ~ s t i l l m a d e}{n-m-l}}

F=\frac{\frac{\sum(\hat{y}-\bar{y})^{2}}{m}}{\frac{\sum(y-\hat{y})^{2}}{(n-m-l)}}.

where n is the size of the sample, and m is the number of explanatory variables (how many x's there are in the regression equation).

If \Sigma(\hat{y}-\bar{y})^{2} the sum of squares regression (the improvement), is large relative to \Sigma(\hat{y}-\bar{y})^{3}, the sum of squares residual (the mistakes still made), then the F-score will be large. In a population where there is no functional relationship between y and the x's, the regression line will have a slope of zero (it will be flat), and the \hat{y} will be close to y. As a result very few samples from such populations will have a large sum of squares regression and large F-scores. Because this F-score is distributed like the one in the F-tables, the tables can tell you whether the F-score a sample regression equation produces is large enough to be judged unlikely to occur if y \neq \mathrm{f}(y \left.x_{1}, x_{2}, \ldots, x_{\mathrm{m}}\right). The sum of squares regression is divided by the number of explanatory variables to account for the fact that it always decreases when more variables are added. You can also look at this as finding the improvement per explanatory variable. The sum of squares residual is divided by a number very close to the number of observations because it always increases if more observations are added. You can also look at this as the approximate mistake per observation.

H_{0}: y \neq f\left(y, x_{1}, x_{2}, \cdots, x_{m}\right)

To test to see if a regression equation was worth estimating, test to see if there seems to be a functional relationship:

H_{a}: y=f\left(y, x_{1}, x_{2}, \cdots, x_{m}\right)

This might look like a two-tailed test since H_{\mathrm{o}} has an equal sign. But, by looking at the equation for the F-score you should be able to see that the data support H_{\mathrm{a}} only if the F-score is large. This is because the data support the existence of a functional relationship if the sum of squares regression is large relative to the sum of squares residual. Since F-tables are usually one-tail tables, choose an \alpha, go to the F-tables for that \alpha and (m, n-m-1) df, and find the table F. If the computed F is greater than the table F, then the computed F is unlikely to have occurred if H_{\mathrm{o}} is true, and you can safely decide that the data support H_{\mathrm{a}}. There is a functional relationship in the population.

Now that you have learned all the necessary steps in estimating a simple regression model, you may take some time to re-estimate the Nelson apartment model or any other simple regression model, using the interactive Excel template shown in Figure 8.6. Like all other interactive templates in this textbook, you can change the values in the yellow cells only. The result will be shown automatically within this template. For this template, you can only estimate simple regression models with 30 observations. You use special paste/values when you paste your data from other spreadsheets. The first step is to enter your data under independent and dependent variables. Next, select your alpha level. Check your results in terms of both individual and overall significance. Once the model has passed all these requirements, you can select an appropriate value for the independent variable, which in this example is the distance to downtown, to estimate both the confidence intervals for the average price of such an apartment, and the prediction intervals for the selected distance. Both these intervals are discussed later in this chapter. Remember that by changing any of the values in the yellow areas in this template, all calculations will be updated, including the tests of significance and the values for both confidence and prediction intervals.


Figure 8.6 Interactive Excel Template for Simple Regression.