Read this chapter, which provides a general overview of regression. Focus on the Correlation and covariance section. How would you define correlation and covariance?
What is regression?
Before starting to learn about regression, go back to algebra and review what a function is. The definition of a function can be formal, like the one in my freshman calculus text: "A function is a set of ordered pairs of numbers (x,y) such that to each value of the first variable there corresponds a unique value of the second variable
". More intuitively, if there is a regular relationship between two variables, there is usually a function that describes the relationship. Functions are written in a number of forms. The most general is
, which simply says that the value of y depends on the value of x in some regular fashion, though the form of the relationship is not specified. The simplest functional form is the linear function where:
and
are parameters, remaining constant as x and y change.
is the intercept and
is the slope. If the values of
and
are known, you can find the
that goes with any
by putting the
into the equation and solving. There can be functions where one variable depends on the values values of two or more other variables where
and
together determine the value of
. There can also be non-linear functions, where the value of the dependent variable
First, let us consider the simple case of a two-variable function. You believe that , the dependent variable, is a linear function of
, the independent variable –
depends on
. Collect a sample of (x, y) pairs, and plot them on a set of x, y axes. The basic idea behind regression is to find the equation of the straight line that comes as close as possible to as many of the points as possible. The parameters of the line drawn through the sample are unbiased estimators of the parameters of the line that would come as close as possible to as many of the points as possible in the population, if the population had been gathered and plotted. In keeping with the convention of using Greek letters for population values and Roman letters for sample values, the line drawn through a population is:
while the line drawn through a sample is:
In most cases, even if the whole population had been gathered, the regression line would not go through every point. Most of the phenomena that business researchers deal with are not perfectly deterministic, so no function will perfectly predict or explain every observation.
Imagine that you wanted to study the estimated price for a one-bedroom apartment in Nelson, BC. You decide to estimate the price as a function of its location in relation to downtown. If you collected 12 sample pairs, you would find different apartments located within the same distance from downtown. In other words, you might draw a distribution of prices for apartments located at the same distance from downtown or away from downtown. When you use regression to estimate the parameters of price = (distance), you are estimating the parameters of the line that connects the mean price at each location. Because the best that can be expected is to predict the mean price for a certain location, researchers often write their regression models with an extra term, the error term, which notes that many of the members of the population of (location, price of apartment) pairs will not have exactly the predicted price because many of the points do not lie directly on the regression line. The error term is usually denoted as
, or epsilon, and you often see regression equations written:
Strictly, the distribution of at each location must be normal, and the distributions of
for all the locations must have the same variance (this is known as homoscedasticity to statisticians).