Read this chapter, which provides a general overview of regression. Focus on the Correlation and covariance section. How would you define correlation and covariance?
Simple regression and least squares method
Testing your regression: does y really depend on x?
Understanding that there is a distribution of (apartment price) values at each x (distance) is the key for understanding how regression results from a sample can be used to test the hypothesis that there is (or is not) a relationship between x and y. When you hypothesize that
, you hypothesize that the slope of the line
is not equal to zero. If
was equal to zero, changes in
would not cause any change in
. Choosing a sample of apartments, and finding each apartment's distance to downtown, gives you a sample of (x, y). Finding the equation of the line that best fits the sample will give you a sample intercept,
, and a sample slope,
. These sample statistics are unbiased estimators of the population intercept,
, and slope,
. If another sample of the same size is taken, another sample equation could be generated. If many samples are taken, a sampling distribution of sample
's, the slopes of the sample lines, will be generated. Statisticians know that this sampling distribution of
's will be normal with a mean equal to
, the population slope. Because the standard deviation of this sampling distribution is seldom known, statisticians developed a method to estimate it from a single sample. With this estimated
, a t-statistic for each sample can be computed:
= number of explanatory (
) variables
= estimated standard deviation of
's, often called the standard error
These 's follow the t-distribution in the tables with
df.
Computing is tedious, and is almost always left to a computer, especially when there is more than one explanatory variable. The estimate is based on how much the sample points vary from the regression line. If the points in the sample are not very close to the sample regression line, it seems reasonable that the population points are also widely scattered around the population regression line and different samples could easily produce lines with quite varied slopes. Though there are other factors involved, in general when the points in the sample are farther from the regression line,
is greater. Rather than learn how to compute
, it is more useful for you to learn how to find it on the regression results that you get from statistical software. It is often called the standard error and there is one for each independent variable. The printout in Figure 8.3 is typical.
Figure 8.3 Typical Statistical Package Output for Linear Simple Regression Model
You will need these standard errors in order to test to see if depends on
or not. You want to test to see if the slope of the line in the population,
, is equal to zero or not. If the slope equals zero, then changes in
do not result in any change in
. Formally, for each independent variable, you will have a test of the hypotheses:
If the t-score is large (either negative or positive), then the sample is far from zero (the hypothesized
, and
should be accepted. Substitute zero for
into the t-score equation, and if the t-score is small, b is close enough to zero to accept
. To find out what t-value separates "close to zero" from "far from zero", choose an alpha, find the degrees of freedom, and use a t-table from any textbook, or simply use the interactive Excel template from Chapter 3, which is shown again in Figure 8.4.
Figure 8.4 Interactive Excel Template for Determining t-Value from the t-Table
Remember to halve alpha when conducting a two-tail test like this. The degrees of freedom equal , where
is the size of the sample and
is the number of independent
variables. There is a separate hypothesis test for each independent variable. This means you test to see if
is a function of each
separately. You can also test to see if
(or
) rather than
by using a one-tail test, or test to see if
equals a particular value by substituting that value for
when computing the sample t-score.