Regression Basics

Read this chapter, which provides a general overview of regression. Focus on the Correlation and covariance section. How would you define correlation and covariance?

Simple regression and least squares method

Multiple Regression Analysis

When we add more explanatory variables to our simple regression model to strengthen its ability to explain real-world data, we in fact convert a simple regression model into a multiple regression model. The least squares approach we used in the case of simple regression can still be used for multiple regression analysis.

As per our discussion in the simple regression model section, our low estimated R^{2} indicated that only 50% of the variations in the price of apartments in Nelson, BC, was explained by their distance from downtown. Obviously, there should be more relevant factors that can be added into this model to make it stronger. Let's add the second explanatory factor to this model. We collected data for the area of each apartment in square feet (i.e., x_{2}). If we go back to Excel and estimate our model including the new added variable, we will see the printout shown in Figure 8.7.

Figure 8.7 Excel Printout


The estimates equation of the regression model is:

predicted price of apartments= 60.041 – 5.393*distance& + .03*area

This is the equation for a plane, the three-dimensional equivalent of a straight line. It is still a linear function because neither of the x's nor y is raised to a power nor taken to some root nor are the x's multiplied together. You can have even more independent variables, and as long as the function is linear, you can estimate the slope, \beta, for each independent variable.

Before using this estimated model for prediction and decision-making purposes, we should test three hypotheses. First, we can use the F-score to test to see if the regression model improves our ability to predict price of apartments. In other words, we test the overall significance of the estimated model. Second and third, we can use the t-scores to test to see if the slopes of distance and area are different from zero. These two t-tests are also known as individual tests of significance.

To conduct the first test, we choose an \alpha=.05. The F-score is the regression or model mean square over the residual or error mean square, so the df for the F-statistic are first the df for the regression model and, second, the df for the error. There are 2 and 9 df for the F-test. According to this F-table, with 2 and 9 df, the critical F-score for \alpha=.05 is 4.26.

The hypotheses are:

H_{0}: \text { price } \neq \mathrm{f}(\text { distance, area })

H_{\mathrm{a}} \text { : price }=\mathrm{f}(\text { distance, area })

Because the F-score from the regression, 6.812, is greater than the critical F-score, 4.26, we decide that the data support H_{\mathrm{o}} and conclude that the model helps us predict price of apartments. Alternatively, we say there is such a functional relationship in the population.

Now, we move to the individual test of significance. We can test to see if price depends on distance and area. There are (n-m-1)=(12-2-1)=9 \mathrm{df}. There are two sets of hypotheses, one set for \beta_{1}, the slope for distance, and one set for \beta_{2}, the slope for area. For a small town, one may expect that \beta_{1}, the slope for distance, will be negative, and expect that \beta_{2} will be positive. Therefore, we will use a one-tail test on \beta_{1}, as well as for \beta_{2}:

H_{a}: \beta_{1} Since we have two one-tail tests, the t-values we choose from the t-table will be the same for the two tests. Using \alpha=.05 and 9 df, we choose .05/2=.025 for the t-score for \beta_{1} with a one-tail test, and come up with 2.262. Looking back at our Excel printout and checking the t-scores, we decide that distance does affect price of apartments, but area is not a significant factor in explaining the price of apartments. Notice that the printout also gives a t-score for the intercept, so we could test to see if the intercept equals zero or not.

Alternatively, one may go ahead and compare directly the p-values out of the Excel printout against the assumed level of significance (i.e.,\alpha=.05). We can easily see that the p-values associated with the intercept and price are both less than alpha, and as a result we reject the hypothesis that the associated coefficients are zero (i.e., both are significant). However, area is not a significant factor since its associated p-value is greater than alpha.

While there are other required assumptions and conditions in both simple and multiple regression models (we encourage students to consult an intermediate business statistics open textbook for more detailed discussions), here we only focus on two relevant points about the use and applications of multiple regression.

The first point is related to the interpretation of the estimated coefficients in a multiple regression model. You should be careful to note that in a simple regression model, the estimated coefficient of our independent variable is simply the slope of the line and can be interpreted. It refers to the response of the dependent variable to a one-unit change in the independent variable. However, this interpretation in a multiple regression model should be adjusted slightly. The estimated coefficients under multiple regression analysis are the response of the dependent variable to a one-unit change in one of the independent variables when the levels of all other independent variables are kept constant. In our example, the estimated coefficient of price of an apartment in Nelson, BC, indicates that – for a given size of apartment – it will drop by 5.248*1000=$5248 for every one kilometre that the apartment is away from downtown.

The second point is about the use of R^{2} in multiple regression analysis. Technically, adding more independent variables to the model will increase the value of R^{2}, regardless of whether the added variables are relevant or irrelevant in explaining the variation in the dependent variable. In order to adjust the inflated R^{2} due to the irrelevant variables added to the model, the following formula is recommended in the case of multiple regression:

R_{A d j}^{2}=1-\left(1-R^{2}\right) \frac{n-1}{n-k}

where n is the sample size, and k is number of the estimated parameters in our model.

Back to our earlier Excel results for the multiple regression model estimated for the apartment example, we can see that while the R^{2} has been inflated from .504 to .612 due to the new added factor, apartment size, the adjusted R^{2} has dropped the inflated value to .526. To understand it better, you should pay attention to the associated p-value for the newly added factor. Since this value is more than .05, we cannot reject the hypothesis that the true coefficient of apartment size (area) is significantly different from zero. In other words, in its current situation, apartment size is not a significant factor, yet the value of R^{2}has been inflated!

Furthermore, the adjusted R^{2} indicates that only 61.2% of variations in price of one-bedroom apartments in Nelson, BC, can be explained by their locations and sizes. Almost 40% of the variations of the price still cannot be explained by these two factors. One may seek to improve this model, by searching for more relevant factors such as style of the apartment, year built, etc. and add them in to this model.

Using the interactive Excel template shown in Figure 8.8, you can estimate a multiple regression model. Again, enter your data into the yellow cells only. For this template you are allowed to use up to 50 observations for each column. Like all other interactive templates in this textbook, you use special paste/values when you paste your data from other spreadsheets. Specifically, if you have fewer than 50 data entries, you must also fill out the rest of the empty yellow cells under X1, X2, and Y with zeros. Now, select your alpha level. By clicking enter, you will not only have all your estimated coefficients along with their t-values, etc., you will also be guided as to whether the model is significant both overall and individually. If your p-value associated with F-value within the ANOVA table is not less than the selected alpha level, you will see a message indicating that your estimated model is not overall significant, and as a result, no values for C.I. and P.I. will be shown. By either changing the alpha level and/or adding more accurate data, it is possible to estimate a more significant multiple regression model.


Figure 8.8 Interactive Excel Template for Multiple Regression Model

One more point is about the format of your assumed multiple regression model. You can see that the nature of the associations between the dependent variable and all the independent variables may not always be linear. In reality, you will face cases where such relationships may be better formed by a nonlinear model. Without going into the details of such a non-linear model, just to give you an idea, you should be able to transform your selected data for X1, X2, and Y before estimating your model. For instance, one possible multiple regression non-linear model may be a model in which both the dependent and independent variables have been transformed to a natural logarithm rather than a level. In order to estimate such a model within Figure 8.5, all you need to do is transform the data in all three columns in a separate sheet from level to logarithm. In doing this, simply use =\log (\text { say A1 }) where in cell A1 you have the first observation of X1, and =\log (\text { say B1 }),.... Finally, simply cut and special paste/value into the yellow columns within the template. Now you have estimated a multiple regression model with both sides in a non-linear form (i.e., log form).