Correlation
Site: | Saylor Academy |
Course: | MA121: Introduction to Statistics |
Book: | Correlation |
Printed by: | Guest user |
Date: | Thursday, 3 April 2025, 10:51 PM |
Description
Read these sections on correlation. You will learn the interpretation and calculation of the correlation coefficient, how to test its significance, and the relation between correlation and causation.
The Correlation Coefficient r
Besides looking at the scatter plot and seeing that a line seems reasonable, how can you tell if the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatterplot) of the strength of the relationship between and
.
The correlation coefficient, , developed by Karl Pearson in the early 1900s, is a numerical measure of the strength of association between the independent variable
and the dependent variable
.
The correlation coefficient is calculated as
where = the number of data points.
If you suspect a linear relationship between and
, then
can measure how strong the linear relationship is.
What the VALUE of
tells us:
- The value of
is always between -1 and +1:
.
- The size of the correlation
indicates the strength of the linear relationship between
and
. Values of
close to -1 or to +1 indicate a stronger linear relationship between
and
.
- If
there is absolutely no linear relationship between
and
(no linear correlation).
- If
, there is perfect positive correlation. If
, there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not generally happen.
What the SIGN of
tells us
- A positive value of
means that when
increases,
tends to increase and when
decreases,
tends to decrease (positive correlation).
- A negative value of
means that when
increases,
tends to decrease and when
decreases,
tends to increase (negative correlation).
- The sign of
is the same as the sign of the slope,
, of the best fit line.
Strong correlation does not suggest that causes
or
causes
. We say "correlation does not imply causation". For example, every person who learned math in the 17th century is dead. However, learning math does not necessarily cause death!
a. A scatter plot showing data with a positive correlation. b. A scatter plot showing data with a negative correlation.
c. A scatter plot showing data with zero correlation.
The formula for looks formidable. However, computer spreadsheets, statistical software, and many calculators can quickly calculate
. The correlation coefficient
is the bottom item in the output screens for the LinRegTTest on the TI-83, TI-83+, or TI-84+ calculator (see previous section for instructions).
The Coefficient of Determination
r2 is called the coefficient of determination. r2 is the square of the correlation coefficient , but is usually stated as a percent, rather than in decimal form. r2
has an interpretation in the context of the data:
-
, when expressed as a percent, represents the percent of variation in the dependent variable
that can be explained by variation in the independent variable
using the regression (best fit) line.
, when expressed as a percent, represents the percent of variation in y that is NOT explained by variation in
using the regression line. This can be seen as the scattering of the observed data points about the regression line.
Consider the third exam/final exam example introduced in the previous section
The line of best fit is:
The correlation coefficient is
The coefficient of determination is
Interpretation of
in the context of this example:
Approximately 44% of the variation (0.4397 is approximately 0.44) in the final exam grades can be explained by the variation in the grades on the third exam, using the best fit regression line.
Therefore approximately 56% of the variation (1 - 0.44 = 0.56) in the final exam grades can NOT be explained by the variation in the grades on the third exam, using the best fit regression line. (This is seen as the scattering of the points about the line).
**With contributions from Roberta Bloom.
Glossary
Coefficient of Correlation
A measure developed by Karl Pearson (early 1900s) that gives the strength of association between the independent variable and the dependent variable. The formula is:
where is the number of data points. The coefficient cannot be more then 1 and less then -1. The closer the coefficient is to ±1, the stronger the evidence of a significant linear relationship between
and
.
Source: Barbara Illowsky, Ph.D.,Susan Dean, https://cnx.org/contents/XgdE-Z55@40.9:XEKQgmhr@12/Correlation-Coefficient-and-Coefficient-of-Determination This work is licensed under a Creative Commons Attribution 3.0 License.
Testing the Significance of the Correlation Coefficient
The correlation coefficient, , tells us about the strength of the linear relationship between
and
. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient
and the sample size
, together.
We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.
The sample data is used to compute , the correlation coefficient for the sample. If we had data for the entire population, we could find the population correlation coefficient. But because we only have sample data, we can not calculate the population correlation coefficient. The sample correlation coefficient,
, is our estimate of the unknown population correlation coefficient.
The symbol for the population correlation coefficient is , the Greek letter "rho".
= population correlation coefficient (unknown)
= sample correlation coefficient (known; calculated from sample data)
The hypothesis test lets us decide whether the value of the population correlation coefficient is "close to 0" or "significantly different from 0". We decide this based on the sample correlation coefficient
and the sample size
.
If the test concludes that the correlation coefficient is significantly different from 0, we say that the correlation coefficient is "significant".
- Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between
and
because the correlation coefficient is significantly different from 0".
- What the conclusion means: There is a significant linear relationship between
and
. We can use the regression line to model the linear relationship between
and
in the population.
If the test concludes that the correlation coefficient is not significantly different from 0 (it is close to 0), we say that correlation coefficient is "not significant".
- Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between
and
because the correlation coefficient is not significantly different from 0".
- What the conclusion means: There is not a significant linear relationship between
and
. Therefore we can NOT use the regression line to model a linear relationship between
and
in the population.
- If
is significant and the scatter plot shows a linear trend, the line can be used to predict the value of
for values of
that are within the domain of observed
values.
- If
is not significant OR if the scatter plot does not show a linear trend, the line should not be used for prediction.
- If
is significant and if the scatter plot shows a linear trend, the line may NOT be appropriate or reliable for prediction OUTSIDE the domain of observed
values in the data.
PERFORMING THE HYPOTHESIS TEST
SETTING UP THE HYPOTHESES:
What the hypotheses mean in words:
- Null Hypothesis
The population correlation coefficient IS NOT significantly different from 0. There IS NOT a significant linear relationship(correlation) between
and
in the population.
- Alternate Hypothesis
The population correlation coefficient IS significantly DIFFERENT FROM 0. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between
and
in the population.
DRAWING A CONCLUSION:
There are two methods to make the decision. Both methods are equivalent and give the same result.
Method 1: Using the p-value
Method 2: Using a table of critical values
In this chapter of this textbook, we will always use a significance level of 5%,
Note: Using the p-value method, you could choose any appropriate significance level you want; you are not limited to using . But the table of critical values provided in this textbook assumes that we are using a significance level of 5%,
. (If we wanted to use a different significance level than 5% with the critical value method, we would need different tables of critical values that are not provided in this textbook).
METHOD 1: Using a p-value to make a decision
The linear regression t-test LinRegTTEST on the TI-83+ or TI-84+ calculators calculates the p-value.
On the LinRegTTEST input screen, on the line prompt for or
, highlight "≠ 0"
The output screen shows the p-value on the line that reads "".
(Most computer statistical software can calculate the p-value.)
If the p-value is less than the significance level (
):
- Decision: REJECT the null hypothesis.
- Conclusion: "There is sufficient evidence to conclude that there is a significant linear relationship between
and
because the correlation coefficient is significantly different from 0".
If the p-value is NOT less than the significance level (
)
- Decision: DO NOT REJECT the null hypothesis.
- Conclusion: "There is insufficient evidence to conclude that there is a significant linear relationship between
and
because the correlation coefficient is NOT significantly different from 0".
Calculation Notes:
You will use technology to calculate the p-value. The following describe the calculations to compute the test statistics and the p-value:
The p-value is calculated using a -distribution with
degrees of freedom.
The formula for the test statistic is . The value of the test statistic,
, is shown in the computer or calculator output along with the p-value. The test statistic
has the same sign as the correlation coefficient
.
The p-value is the combined area in both tails.
An alternative way to calculate the p-value () given by LinRegTTest is the command 2*tcdf(abs(t),10^99, n-2) in 2nd DISTR.
THIRD EXAM vs FINAL EXAM EXAMPLE: p value method
- Consider the third exam/final exam example.
- The line of best fit is:
with
and there are
data points.
- Can the regression line be used for prediction? Given a third exam score (
value), can we use the line to predict the final exam score (predicted
value)?
The p-value, 0.026, is less than the significance level of
Decision: Reject the Null Hypothesis Ho
Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between and
because the correlation coefficient is significantly different from 0.
Because is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.
METHOD 2: Using a table of Critical Values to make a decision
The 95% Critical Values of the Sample Correlation Coefficient Table at the end of this chapter (before the Summary) may be used to give you a good idea of whether the computed value of is significant or not. Compare
to the appropriate critical value in the table. If
is not between the positive and negative critical values, then the correlation coefficient is significant. If
is significant, then you may want to use the line for prediction.
Suppose you computed using
data points.
. The critical values associated with
are -0.632 and + 0.632. If
negative critical value or
positive critical value, then
is significant. Since
and 0.801 > 0.632,
is significant and the line may be used for prediction. If you view this example on a number line, it will help you.

Figure 1. is not significant between -0.632 and +0.632.
. Therefore,
is significant.
Suppose you computed with 14 data points.
. The critical values are -0.532 and 0.532. Since −0.624 < −0.532,
is significant and the line may be used for prediction.

Figure 2. . Therefore,
is significant.
Suppose you computed and
.
. The critical values are -0.811 and 0.811. Since −0.811 < 0.776 < 0.811,
is not significant and the line should not be used for prediction.

Figure 3. . Therefore,
is not significant.
THIRD EXAM vs FINAL EXAM EXAMPLE: critical value method
- Consider the third exam/final exam example.
- The line of best fit is:
- with
and there are
data points.
- Can the regression line be used for prediction? Given a third exam score (
value), can we use the line to predict the final exam score (predicted
value)?
Use the "95% Critical Value" table for with
The critical values are -0.602 and +0.602
Since 0.6631>0.602 , is significant.
Conclusion:There is sufficient evidence to conclude that there is a significant linear relationship between and
because the correlation coefficient is significantly different from 0.
Because is significant and the scatter plot shows a linear trend, the regression line can be used to predict final exam scores.
Additional Practice Examples using Critical Values
Suppose you computed the following correlation coefficients. Using the table at the end of the chapter, determine if is significant and the line of best fit associated with each
can be used to predict a
value. If it helps, draw a number line.
-
and the sample size,
, is 19. The
. The critical value is -0.456. −0.567<−0.456 so
is significant.
and the sample size,
, is 9. The
. The critical value is 0.666. 0.708>0.666 so
is significant.
and the sample size,
, is 14. The
. The critical value is 0.532. 0.134 is between -0.532 and 0.532 so
is not significant.
and the sample size,
, is 5. No matter what the
s are,
is between the two critical values so
is not significant.
Assumptions in Testing the Significance of the Correlation Coefficient
Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between and
in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between
and
in the population.
The regression line equation that we calculate from the sample data gives the best fit line for our particular sample. We want to use this best fit line for the sample as an estimate of the best fit line for the population. Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.
The assumptions underlying the test of significance are:
- There is a linear relationship in the population that models the average value of
for varying values of
. In other words, the expected value of
for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population).
- The
values for any particular
value are normally distributed about the line. This implies that there are more
values scattered closer to the line than are scattered farther away. Assumption (1) above implies that these normal distributions are centered on the line: the means of these normal distributions of
values lie on the line.
- The standard deviations of the population
values about the line are equal for each value of
. In other words, each of these normal distributions of
values has the same shape and spread about the line.
- The residual errors are mutually independent (no pattern).
Figure 4. The values for each
value are normally distributed about the line with the same standard deviation. For each
value, the mean of the
values lies on the regression line. More
values lie near the line than are scattered further away from the line.