BUS204 Study Guide

Unit 6: Correlation and Regression

6a. Identify the dependent and independent variables in the linear regression model

Why would the $x$ variable be called independent, and the $y$ value be called dependent?

The independent variable is the input (usually $x$ ) variable in a linear regression equation. It is called independent because it can be selected by the researcher.

The dependent variable is the output (usually $y$ ) variable in a linear regression equation. It is called dependent because it cannot be selected and is the result of an equation.

To decide which is which, ask yourself which variable is trying to be predicted from the other. The result of the study is dependent; the variable you’d have to input is independent.

To review, see The Correlation Coefficient r and Examples of Univariate and Multivariate Data.

6b. Calculate the equation of the regression line and plot it

What does the regression line drawn on a scatter plot tell you?
What must you do before calculating the regression line? Why is this necessary?

The regression line (usually called the least-squares regression line) is a linear equation in the form $y=a+bx$ that presents the best possible linear relationship.

The regression equation will always give you the best-fit line, assuming that there is actually a linear fit between $x$ and $y$ . You need to calculate the linear correlation coefficient to determine if there is a linear relationship. If the line isn't a good fit, the relationship isn't linear. A relationship between two variables is linear if the dots in the resulting scatter plot make up roughly a straight line and if you can reasonably use a linear equation $y=a+bx$ to predict $y$ from $x$ .

There are formulas given in the book to calculate the correlation coefficient and the regression line slope, but typically you will do this using computer software, apps, websites, or graphing calculators.

To review, see The Regression Equation.

6c. Describe the importance of the correlation coefficient and r-squared, and apply these concepts

How does the correlation coefficient relate to the slope of the regression line? What do they have in common? Will a correlation of 1 always give you the same sloped line?
What is the relationship between the correlation coefficient and r-squared? What does each tell you?
If the points in the scatter plot make a perfect semicircular shape, there is a clear relationship between $x$ and $y$ . Does this mean that the correlation r should be close to 1? Why or why not?

A scatter plot is an x-y diagram where all of the data pairs are plotted.

The correlation coefficient ( $\rho$ ) is the measure of linear relationship between $x$ and $y$ . The sample correlation coefficient is calculated from the data and represented by the letter $r$ . The correlation will always be a number between and including -1 and 1. The closer to 1, the better the positive fit (positively sloped line). The closer to -1, the fit is still strong but is a negative fit (negatively sloped line). A correlation of 0 means there is no linear relationship at all. Values of exactly -1, 0, and 1 are very rare in practice. Usually, even randomly selected $x$ and $y$ will get you around ±0.1 to 0.2. You will only see a +1 or a -1 if all dots in the scatter plot are perfectly aligned.

R-squared is the square of the correlation coefficient. The interpretation is that this is the proportion of the dependent variable $y$ that can be explained by the independent variable $x$ . So if the correlation $r = 0.6$ , then about 36% of the variation in $y$ can be explained by variation in $x$ .

To review, see The Correlation Coefficient.

6d. Define outlier, identify examples of outliers, and describe what an outlier can do to summaries of data

What is an outlier? What should be done when encountering one?
What does it mean for a statistic to be resistant to outliers?
What is the difference between the mean and median, and which should be used when outliers are present?

An outlier is a data point that doesn't seem to fit with the rest. This can occur with univariate (single variable) and bivariate (this unit) data. Outliers can happen for various reasons. They may be true data or possibly the result of a miscalculation or mismeasurement, or the data entry person in that weight loss study input a loss of 95 pounds instead of 9.5. They can not be discarded. If the statistician is unsure, he/she should contact the researcher or double-check with the data entry person.

Outliers in data will severely affect the mean. In other words, the mean is not resistant to outliers. An unusually high data value will skew the data out to the right, which is why we refer to that data as right-skewed. The same thing happens if the outlier is well below the rest of the data; the mean will be much lower than the median, in which case the data is left-skewed. In symmetric (not necessarily normally distributed) data, the mean and median are roughly the same, and there are either no outliers or, in the case of a bell curve, the outliers are evenly on both sides.

Because the least-squares regression line is based on the distance between data points and the mean, it will not be resistant to outliers, and an outlier can severely affect the equation. In this case, it is best to investigate the outlying data point and whether it's true. If it is a false entry, it may be discardable. If it is a true entry, it may be ignorable. This depends on the study and the researcher.

To review, see Linear Regression.

6e. Estimate a regression line and identify the effect of the independent variable on the dependent variable

When finding the correlation coefficient $r$ for a group of bi-variate data, how can we know whether that $r$ value is significant – that is, that there is a good linear fit for the data?
What is a good way to estimate the line of best fit?

You can estimate the equation of a regression line by eyeballing a line that best fits the data, finding two points on that line, and then using the rules of algebra to find the equation of a line ( $y=mx+b$ ).

The significance of the correlation coefficient can be found through a T-test where the test statistic is $T=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$ and the degrees of freedom is $n-2$ .

To review, see Testing the Significance of the Correlation Coefficient.

6f. Draw a scatter plot, find the equation of a least-squares line, and draw a least-squares line

What is the first step you must do in the process of finding the regression line?
Why is it called "least squares" regression?
Could non-linear data still generate a believable regression line? If so, how can we avoid finding a regression line for data that is not linear?

A scatter plot is an x-y diagram where all of the data pairs are plotted. If the data is linear in shape, we can estimate a line of best-fit $y=a+bx$ , also called a least squares line. We draw a line between each point and the regression line and make that into the side of a square. The equation of the line will be the one for which the sum of the areas of those squares is as small as possible, thus the "least squares" regression line.

There are formulas to calculate the correlation coefficient and the slope of a regression line,

$r\approx \frac{n\sum xy -\sum x \sum y}{\sqrt{n\sum x^2-(\sum x)^2}\sqrt{n\sum y^2 - (\sum y)^2}}$

$b=r\frac{s_{y}}{s_{x}}$

but usually, these tasks will be done using computer software, apps, websites, or graphing calculators.

We will use the formula to calculate the slope, then use the properties of linear equations in algebra to plug in the mean values of $x$ and $y$ , as well as the slope $b$ , into the equation $\bar{y}=a+b\bar{x}$ to find $a$ .

To review, see The Regression Equation.

Unit 6 Vocabulary

This vocabulary list includes terms that might help you with the review items above and some terms you should be familiar with to be successful in completing the final exam for the course.

Try to think of the reason why each term is included.

correlation coefficient
dependent variable
independent variable
R-squared
regression line