Multicollinearity
| Site: | Saylor University |
| Course: | CS207: Fundamentals of Machine Learning |
| Book: | Multicollinearity |
| Printed by: | Guest user |
| Date: | Friday, April 10, 2026, 5:20 PM |
Description
What is Multicollinearity?
In regression analysis, it's essential to recognize and address common pitfalls that can undermine the reliability of your models. The resources provided delve into key challenges such as multicollinearity, extrapolation, and other factors that can distort your results. Understanding these pitfalls and navigating them will greatly improve your ability to build reliable models.
As you work through the examples, take time to evaluate your own data for multicollinearity, extrapolation, and outliers. Explore different methods for detecting these issues and mitigating their impact. By understanding and addressing these common regression challenges, you will improve your models and gain valuable insights into the nuances of regression analysis.
As stated in the lesson overview, multicollinearity exists whenever two or more of the predictors in a regression model are moderately or highly correlated. Now, you might be wondering why can't a researcher just collect his data in such a way to ensure that the predictors aren't highly correlated. Then, multicollinearity wouldn't be a problem, and we wouldn't have to bother with this silly lesson.
Unfortunately, researchers often can't control the predictors. Obvious examples include a person's gender, race, grade point average, math SAT score, IQ, and starting salary. For each of these predictor examples, the researcher just observes the values as they occur for the people in her random sample.
Multicollinearity happens more often than not in such observational studies. And, unfortunately, regression analyses most often take place on data obtained from observational studies. If you aren't convinced, consider the example data sets for this course. Most of the data sets were obtained from observational studies, not experiments. It is for this reason that we need to fully understand the impact of multicollinearity on our regression analyses.
Types of multicollinearity
- Structural multicollinearity is a mathematical artifact caused by creating new predictors from other predictors - such as creating the predictor \(x^2\) from the predictor x.
- Data-based multicollinearity, on the other hand, is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.
In the case of structural multicollinearity, the multicollinearity is induced by what you have done. Data-based multicollinearity is the more troublesome of the two types of multicollinearity. Unfortunately, it is the type we encounter most often!
Example 12-1
Let's take a quick look at an example in which data-based multicollinearity exists. Some researchers observed - notice the choice of word! - the following Blood Pressure data on 20 individuals with high blood pressure:
- blood pressure (y = BP, in mm Hg)
- age (x1 = Age, in years)
- weight (x2 = Weight, in kg)
- body surface area (x3 = BSA, in sq m)
- duration of hypertension (x4 = Dur, in years)
- basal pulse (x5 = Pulse, in beats per minute)
- stress index (x6 = Stress)
The researchers were interested in determining if a relationship exists between blood pressure and age, weight, body surface area, duration, pulse rate, and/or stress level.
The matrix plot of BP, Age, Weight, and BSA:

and the matrix plot of BP, Dur, Pulse, and Stress:

allow us to investigate the various marginal relationships between the response BP and the predictors. Blood pressure appears to be related fairly strongly to Weight and BSA, and hardly related at all to the Stress level.
The matrix plots also allow us to investigate whether or not relationships exist among the predictors. For example, Weight and BSA appear to be strongly related, while Stress and BSA appear to be hardly related at all.
The following correlation matrix:
Correlation: BP, Age, Weight, BSA, Dur, Pulse, Stress
| BP | Age | Weight | BSA | Dur | Pulse | |
|---|---|---|---|---|---|---|
| Age | 0.659 | |||||
| Weight | 0.950 | 0.407 | ||||
| BSA | 0.866 | 0.378 | 0.875 | |||
| Dur | 0.293 | 0.344 | 0.201 | 0.131 | ||
| Pulse | 0.721 | 0.619 | 0.659 | 0.465 | 0.402 | |
| Stress | 0.164 | 0.368 | 0.034 | 0.018 | 0.312 | 0.506 |
provides further evidence of the above claims. Blood pressure appears to be related fairly strongly to Weight (r = 0.950) and BSA (r = 0.866), and hardly related at all to Stress level (r = 0.164). And, Weight and BSA appear to be strongly related (r = 0.875), while Stress and BSA appear to be hardly related at all (r = 0.018). The high correlation among some of the predictors suggests that data-based multicollinearity exists.
Now, what we need to learn is the impact of multicollinearity on regression analysis. Let's go do it!
Source: Penn State Eberly College of Science, https://online.stat.psu.edu/stat501/lesson/12/12.1
This work is licensed under a Creative Commons Attribution 4.0 License.
Extrapolation
"Extrapolation" beyond the "scope of the model" occurs when one uses an estimated regression equation to estimate a mean \(\mu Y\) or to predict a new response \(y_{new}\) for x values not in the range of the sample data used to determine the estimated regression equation. In general, it is dangerous to extrapolate beyond the scope of the model. The following example illustrates why this is not a good thing to do.
Researchers measured the number of colonies of grown bacteria for various concentrations of urine (ml/plate). The scope of the model - that is, the range of the x values - was 0 to 5.80 ml/plate. The researchers obtained the following estimated regression equation:

Using the estimated regression equation, the researchers predicted the number of colonies at 11.60 ml/plate to be 16.0667 + 1.61576(11.60) or 34.8 colonies. But when the researchers conducted the experiment at 11.60 ml/plate, they observed that the number of colonies decreased dramatically to about 15.1 ml/plate:

The moral of the story is that the trend in the data as summarized by the estimated regression equation does not necessarily hold outside the scope of the model.
Other Regression Pitfalls
Nonconstant Variance
Excessive nonconstant variance can create technical difficulties with a multiple linear regression model. For example, if the residual variance increases with the fitted values, then prediction intervals will tend to be wider than they should be at low fitted values and narrower than they should be at high fitted values. Some remedies for refining a model exhibiting excessive nonconstant variance include the following:
- Apply a variance-stabilizing transformation to the response variable, for example, a logarithmic transformation (or a square root transformation if a logarithmic transformation is "too strong" or a reciprocal transformation if a logarithmic transformation is "too weak"). We explored this in more detail in Lesson 9.
- Weight the variances so that they can be different for each set of predictor values. This leads to weighted least squares, in which the data observations are given different weights when estimating the model. We'll cover this in Lesson 13.
- A generalization of weighted least squares is to allow the regression errors to be correlated with one another in addition to having different variances. This leads to generalized least squares, in which various forms of nonconstant variance can be modeled.
- For some applications, we can explicitly model the variance as a function of the mean, E(Y). This approach uses the framework of generalized linear models, which we discuss in the optional content.
Autocorrelation
One common way for the "independence" condition in a multiple linear regression model to fail is when the sample data have been collected over time and the regression model fails to effectively capture any time trends. In such a circumstance, the random errors in the model are often positively correlated over time, so each random error is more likely to be similar to the previous random error than it would be if the random errors were independent of one another. This phenomenon is known as autocorrelation (or serial correlation) and can sometimes be detected by plotting the model residuals versus time. We'll explore this further in the optional content.
Overfitting
When building a regression model, we don't want to include unimportant or irrelevant predictors whose presence can overcomplicate the model and increase our uncertainty about the magnitudes of the effects for the important predictors (particularly if some of those predictors are highly collinear). Such "overfitting" can occur the more complicated a model becomes and the more predictor variables, transformations, and interactions are added to a model. It is always prudent to apply a sanity check to any model being used to make decisions. Models should always make sense, preferably grounded in some kind of background theory or sensible expectation about the types of associations allowed between variables. Predictions from the model should also be reasonable (over-complicated models can give quirky results that may not reflect reality).
Excluding Important Predictor Variables
However, there is a potentially greater risk from excluding important predictors than from including unimportant ones. The linear association between two variables ignoring other relevant variables can differ both in magnitude and direction from the association that controls for other relevant variables. Whereas the potential cost of including unimportant predictors might be increased difficulty with interpretation and reduced prediction accuracy, the potential cost of excluding important predictors can be a completely meaningless model containing misleading associations. Results can vary considerably depending on whether such predictors are (inappropriately) excluded or (appropriately) included. These predictors are sometimes called confounding or lurking variables, and their absence from a model can lead to incorrect decisions and poor decision-making.
Example 12-6: Simpson's Paradox
An illustration of how a response variable can be positively associated with one predictor variable when ignoring a second predictor variable, but negatively associated with the first predictor when controlling for the second predictor. The dataset used in the example is available in this file: paradox.txt.
Video Explanation
Missing Data
Real-world datasets frequently contain missing values, so we do not know the values of particular variables for some of the sample observations. For example, such values may be missing because they were impossible to obtain during data collection. Dealing with missing data is a challenging task. Missing data has the potential to adversely affect a regression analysis by reducing the total usable sample size. The best solution to this problem is to try extremely hard to avoid missing data in the first place. When there are missing values that are impossible or too costly to avoid, one approach is to replace the missing values with plausible estimates, known as imputation. Another (easier) approach is to consider only models that contain predictors with no (or few) missing values. This may be unsatisfactory, however, because even a predictor variable with a large number of missing values can contain useful information.
Power and Sample Size
In small datasets, a lack of observations can lead to poorly estimated models with large standard errors. Such models are said to lack statistical power because there is insufficient data to be able to detect significant associations between the response and predictors. So, how much data do we need to conduct a successful regression analysis? A common rule of thumb is that 10 data observations per predictor variable are a pragmatic lower bound for sample size. However, it is not so much the number of data observations that determines whether a regression model is going to be useful, but rather whether the resulting model satisfies the LINE conditions. In some circumstances, a model applied to fewer than 10 data observations per predictor variable might be perfectly fine (if, say, the model fits the data really well and the LINE conditions seem fine), while in other circumstances a model applied to a few hundred data points per predictor variable might be pretty poor (if, say, the model fits the data badly and one or more conditions are seriously violated). For another example, in general, we’d need more data to model interaction compared to a similar model without the interaction. However, it is difficult to say exactly how much data would be needed. It is possible that we could adequately model interaction with a relatively small number of observations if the interaction effect was pronounced and there was little statistical error. Conversely, in datasets with only weak interaction effects and relatively large statistical errors, it might take a much larger number of observations to have a satisfactory model. In practice, we have methods for assessing the LINE conditions, so it is possible to consider whether an interaction model approximately satisfies the assumptions on a case-by-case basis. In conclusion, there is not really a good standard for determining sample size given the number of predictors, since the only truthful answer is, "It depends". In many cases, it soon becomes pretty clear when working on a particular dataset if we are trying to fit a model with too many predictor terms for the number of sample observations (results can start to get a little odd, and standard errors greatly increase). From a different perspective, if we are designing a study and need to know how much data to collect, then we need to get into sample size and power calculations, which rapidly become quite complex. Some statistical software packages will do sample size and power calculations, and there is even some software specifically designed to do just that. When designing a large, expensive study, it is recommended that such software be used or get advice from a statistician with sample size expertise.