Regression Basics

Read this chapter, which provides a general overview of regression. Focus on the Correlation and covariance section. How would you define correlation and covariance?

Simple regression and least squares method

Testing your regression: does y really depend on x?

Understanding that there is a distribution of y (apartment price) values at each x (distance) is the key for understanding how regression results from a sample can be used to test the hypothesis that there is (or is not) a relationship between x and y. When you hypothesize that y=\mathrm{f}(x), you hypothesize that the slope of the line (\beta \text { in } y=\alpha+\beta x+\varepsilon) is not equal to zero. If \beta was equal to zero, changes in x would not cause any change in y. Choosing a sample of apartments, and finding each apartment's distance to downtown, gives you a sample of (x, y). Finding the equation of the line that best fits the sample will give you a sample intercept, \alpha, and a sample slope, \beta. These sample statistics are unbiased estimators of the population intercept, \alpha, and slope, \beta. If another sample of the same size is taken, another sample equation could be generated. If many samples are taken, a sampling distribution of sample \beta's, the slopes of the sample lines, will be generated. Statisticians know that this sampling distribution of b's will be normal with a mean equal to \beta, the population slope. Because the standard deviation of this sampling distribution is seldom known, statisticians developed a method to estimate it from a single sample. With this estimated S_{b}, a t-statistic for each sample can be computed:

t=\frac{b-\not \beta}{\text { estimated } \quad \beta}=\frac{b-\not \beta}{\text { sb }}

where n = sample size

m = number of explanatory (x) variables

b = sample slope

β= population slope

s_{b} = estimated standard deviation of b's, often called the standard error

These t's follow the t-distribution in the tables with n-m-1 df.

Computing s_{b} is tedious, and is almost always left to a computer, especially when there is more than one explanatory variable. The estimate is based on how much the sample points vary from the regression line. If the points in the sample are not very close to the sample regression line, it seems reasonable that the population points are also widely scattered around the population regression line and different samples could easily produce lines with quite varied slopes. Though there are other factors involved, in general when the points in the sample are farther from the regression line, s_{b} is greater. Rather than learn how to compute s_{b}, it is more useful for you to learn how to find it on the regression results that you get from statistical software. It is often called the standard error and there is one for each independent variable. The printout in Figure 8.3 is typical.

Figure 8.3 Typical Statistical Package Output for Linear Simple Regression Model


You will need these standard errors in order to test to see if y depends on x or not. You want to test to see if the slope of the line in the population, \beta, is equal to zero or not. If the slope equals zero, then changes in x do not result in any change in y. Formally, for each independent variable, you will have a test of the hypotheses:

H_{o}: \beta=0

H_{a}: \beta \neq 0

If the t-score is large (either negative or positive), then the sample b is far from zero (the hypothesized \beta, and H_{\mathrm{a}} should be accepted. Substitute zero for b into the t-score equation, and if the t-score is small, b is close enough to zero to accept H_{\mathrm{a}}. To find out what t-value separates "close to zero" from "far from zero", choose an alpha, find the degrees of freedom, and use a t-table from any textbook, or simply use the interactive Excel template from Chapter 3, which is shown again in Figure 8.4.


Figure 8.4 Interactive Excel Template for Determining t-Value from the t-Table

Remember to halve alpha when conducting a two-tail test like this. The degrees of freedom equal  n – m-1 , where n is the size of the sample and m is the number of independent x variables. There is a separate hypothesis test for each independent variable. This means you test to see if y is a function of each x separately. You can also test to see if \beta>0 (or \beta \text { < } 0 ) rather than \beta \neq 0 by using a one-tail test, or test to see if \beta equals a particular value by substituting that value for \beta when computing the sample t-score.