Unit 5: Supervised Learning – Regression


5a. Implement linear regression models using Python

  • What is linear regression, and when is it appropriate to use this model?
  • How do you implement linear regression using Python libraries like Scikit-learn?
  • What is the difference between training and predicting in a regression task?

Linear regression is a fundamental supervised learning technique that models the relationship between input features and a continuous target variable by finding the best-fitting linear equation. It predicts a continuous numerical outcome based on one or more input features. The core idea is to fit a straight line (in simple linear regression) or a hyperplane (in multiple regression) that minimizes the error between the predicted and actual values, typically using the least squares method.

The most common approach to implementing linear regression in Python is through the Scikit-learn library, which is a comprehensive ML library for Python that provides simple and efficient data analysis and modeling tools. The process starts with importing the necessary libraries and splitting the dataset into features (X) and targets (y). The LinearRegression() class from sklearn.linear_model is used to fit, which is the process of training a machine learning model by having it learn patterns from the training data, the model on training data using .fit(X, y).

Once trained, the model can predict outcomes with .predict(X_new). Important outputs include the model coefficients, which are the learned parameters that quantify the strength and direction of the relationship between each input feature and the target variable (slopes), and the intercept, which provides interpretability. For example, in a housing price dataset, the model might predict price as a function of square footage, where the coefficient indicates how much price increases per additional square foot.

Beginners often confuse regression with classification or attempt to use it on categorical data without proper preprocessing, such as encoding categorical variables. Also, without visualizing residuals, signs of poor model fit can be missed. Reinforcing implementation with real datasets and synthetic examples can help you understand how model training, evaluation, and prediction work together to build effective regressors.

To review, see:


5b. Evaluate regression models using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²)

  • How does Mean Absolute Error (MAE) differ from Root Mean Squared Error (RMSE) in sensitivity to outliers?
  • Why does Mean Squared Error (MSE) penalize large errors more heavily than MAE?
  • How is R-squared (R²) interpreted as a "goodness-of-fit" measure, and what are its limitations?
  • When would you prioritize RMSE over MAE for model evaluation?

Evaluating regression models requires understanding how different metrics capture prediction error and model accuracy. Mean absolute error (MAE) calculates the average of the absolute differences between predicted and actual values, making it intuitive and robust to outliers. It is interpreted as the average error in the units of the target variable. For instance, if MAE = 15, the model is off by 15 units on average. In contrast, mean squared error (MSE) squares the differences before averaging, which penalizes larger errors more severely. This makes MSE more sensitive to outliers and useful when large errors have significant consequences, such as in financial modeling. Root mean squared error (RMSE) is simply the square root of MSE. It restores the unit consistency with the original target variable, maintaining MSE's outlier sensitivity while making interpretation easier (for example, putting error in dollars rather than dollars squared).

Another key metric is R-squared (R²), which measures the proportion of variance in the dependent variable that is explained by the model. R² values range from 0 to 1, where higher values indicate better model fit. However, R² increases with more features, regardless of their relevance. Therefore, it can be misleading in high-dimensional models. For such cases, adjusted R² is preferred because it accounts for the number of predictors. Beginners often misinterpret R², assuming that R² > 0.7 is always "good", but in practice, this depends on the context and domain. For example, an R² of 0.95 might be expected in physics, whereas in social sciences, even 0.3 could be significant.

Each metric has its use case. MAE is better when all errors are equally important, and interpretability is needed. RMSE is useful when large errors are especially problematic. In practice, both are often reported together for a balanced view. Feature scaling can affect these metrics, so normalization is important during preprocessing.

To review, see:


5c. Discuss the limitations of regression models

  • How do violations of linearity, constant variance, and normality assumptions invalidate regression results?
  • Why does autocorrelation in time series data violate regression assumptions, and how does it distort statistical inference?
  • What problems arise when applying linear regression to binary outcomes, and how does this lead to nonsensical predictions?

Regression models rely on several key assumptions, and when these are violated, the model's predictions and inferences become unreliable. The linearity assumption implies a straight-line relationship between the independent and dependent variables. When the true relationship is nonlinear (for example, drug efficacy increases and then drops at high doses), regression produces biased results. A related issue is heteroscedasticity, which is a condition where the variance of the residuals (prediction errors) changes across different levels of the independent variables, or non-constant variance, which refers to the situation where the spread or variability of residuals is not consistent across the range of predicted values, where the spread of residuals grows or shrinks across the range of fitted values such as housing price prediction errors increasing with house size. This distorts confidence intervals and undermines the validity of hypothesis tests. Additionally, if residuals do not follow a normal distribution, the model's p-values and t-statistics become unreliable, leading to a higher chance of false positives or negatives.

Another major limitation is autocorrelation, which is the correlation between residuals at different time points or observations, meaning that the error terms are not independent of each other, particularly in time series data. When error terms are correlated across time (such as daily sales that are influenced by previous days), the assumption of independence is violated. This can lead to underestimated standard errors and overconfident conclusions, for example, wrongly attributing a sales jump to a recent campaign when it may just be a seasonal pattern. Moreover, applying linear regression to binary outcomes like predicting whether a person has a disease results in illogical predictions outside the 0–1 range (like probabilities of -0.2 or 1.3), violating assumptions of constant variance and normality. In such cases, logistic regression or other classification models are more appropriate.

Many people often miss subtle issues like interaction effects, where the impact of one feature depends on another (for example, the effect of education on income differing by gender), or the dangers of extrapolation, where predictions made beyond the observed data range (like estimating house prices for 10,000 sq ft homes) can be highly misleading.

To review, see:


Unit 5 Vocabulary


This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • adjusted R²
  • autocorrelation
  • extrapolation
  • heteroscedasticity
  • interaction effect
  • linearity assumption
  • linear regression
  • mean absolute error (MAE)
  • mean squared error (MSE
  • model coefficient
  • non-constant variance
  • R-squared (R²)
  • root mean squared error (RMSE)
  • Scikit-learn