Unit 8: Model Evaluation and Validation

8a. Apply a train-test split on datasets and cross-validation techniques to assess model performance

Why is a train-test split essential for evaluating machine learning models, and what ratio is commonly used?
How does cross-validation, particularly k-fold cross-validation, improve the reliability of performance estimates compared to a simple train-test split?
What are the benefits and trade-offs of using stratified sampling during splitting, especially for imbalanced datasets?

To assess how well a machine learning model generalizes to new data, practitioners apply a train-test split, dividing the dataset into two subsets: one for training the model and the other for testing its performance. A typical split ratio is 80:20 or 70:30, where the majority is used for training and the remainder for validation. This process helps detect overfitting, where a model performs well on training data but poorly on unseen data. However, a single split might not reflect the true variability in the data. To address this, cross-validation is used, especially k-fold cross-validation, where the dataset is divided into k equal parts, and the model is trained and validated k times, each time using a different fold as the test set and the rest for training. This results in a more robust and stable estimate of model performance.

Stratified sampling ensures that each fold maintains the same class distribution as the original dataset, which is especially crucial in imbalanced datasets (like with fraud detection). Without stratification, some folds might lack minority class samples, skewing results. Key metrics like accuracy, precision, or F1-score should be calculated over all folds and averaged to represent model quality.

Many people often confuse cross-validation with hyperparameter tuning. While cross-validation is used for model assessment, it also plays a role in model selection and optimization. Remember the difference between validation sets (which are used during model development) and test sets, which are kept untouched until final evaluation.

To review, see:

Train-Test Split and Cross-Validation

8b. Identify overfitting and underfitting in models

What are the key differences between overfitting and underfitting, and how do they affect model performance?
How can performance metrics on training vs. validation data help detect overfitting or underfitting?
What visualization tools and diagnostic techniques help identify these issues during model training?

In machine learning, recognizing overfitting and underfitting is essential to ensure that models generalize well to unseen data. Overfitting occurs when a model learns not just the underlying patterns but also the noise (random variation, measurement errors, or irrelevant information in the data that doesn't represent the true underlying patterns the model should learn) in the training data, resulting in high accuracy on the training set but poor performance on the validation or test set. This often happens when a model is too complex, uses too many features or deep trees, and memorizes the data rather than generalizing. On the other hand, underfitting arises when a model is too simple to capture the patterns in the data, leading to poor performance on both training and test sets. A classic sign of underfitting is high bias, while overfitting is associated with high variance.

One common diagnostic technique is to plot learning curves showing training and validation loss as a function of training size or epochs. In overfitting, the training loss is low, but the validation loss is high and diverging. In underfitting, both losses remain high. Comparing metrics like accuracy, mean squared error (MSE), or F1-score across training and validation datasets can also reveal these issues.

Many people often mistakenly attribute poor test performance to model failure when, in fact, it may indicate overfitting due to a lack of regularization or underfitting due to insufficient model complexity. Know the importance of balancing bias and variance using tools such as cross-validation, simpler models, or early stopping, and use visual aids to make the concepts intuitive.

To review, see:

Overfitting and Underfitting

8c. Apply techniques such as L1 and L2 regularization and early stopping to avoid overfitting and improve model performance

How do L1 and L2 regularization help reduce overfitting by penalizing model complexity?
What is the difference between early stopping and traditional regularization methods in controlling overfitting?
How can loss curves be interpreted to apply these techniques effectively during model training?

To prevent overfitting, machine learning practitioners apply techniques like L1 regularization, L2 regularization, and early stopping to control model complexity. L1 regularization (also called lasso) adds a penalty proportional to the absolute value of the model's weights, encouraging sparsity and effectively eliminating irrelevant features. L2 regularization (also called ridge) adds a penalty based on the squared magnitude of weights, shrinking them without necessarily removing any, making it ideal when all features contribute small effects. Both are forms of regularization, a strategy to constrain model parameters and reduce variance, leading to better generalization on unseen data.

Another powerful technique is early stopping, where training is halted once the model's validation loss starts increasing, even if training loss continues to decrease. This indicates the model is beginning to memorize noise. By monitoring loss curves (graphs showing training and validation loss over epochs), you can visually detect the point where overfitting begins. A sharp divergence between the two curves signals that training should stop.

Many people often confuse minimizing training loss with overall model improvement. The goal is low validation error, not perfect training accuracy. Remember that a combination of regularization (L1/L2), smaller models, and early stopping typically results in robust, generalizable models, especially in noisy or high-dimensional datasets.

To review, see:

Techniques to Avoid Overfitting

Unit 8 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

early stopping
k-fold cross-validation
L1 regularization
L2 regularization
lasso
loss curve
noise
overfitting
ridge
stratified sampling
train-test split
underfitting
validation set