Model Complexity
| Site: | Saylor University |
| Course: | CS207: Fundamentals of Machine Learning |
| Book: | Model Complexity |
| Printed by: | Guest user |
| Date: | Wednesday, April 15, 2026, 7:12 PM |
Description
Model Complexity
A model predicting house prices might perform exceptionally well on historical data but struggle to accurately predict new properties. Similarly, in fraud detection, a model trained on past fraudulent activities might fail to detect new, unknown forms of fraud. These examples illustrate how overfitting can hinder a model's ability to generalize to new, unseen data. In this section, you will explore strategies to mitigate overfitting, such as controlling model complexity, applying regularization techniques, and interpreting loss curves.
As you read, reflect on these questions: What does it mean if training loss decreases while validation loss increases? How can early stopping or tuning hyperparameters help mitigate overfitting? How do loss curves help distinguish between underfitting and overfitting?
The previous unit introduced the following model, which miscategorized a lot of trees in the test set:

The preceding model contains a lot of complex shapes. Would a simpler model handle new data better? Suppose you replace the complex model with a ridiculously simple model--a straight line.

The simple model generalizes better than the complex model on new data. That is, the simple model made better predictions on the test set than the complex model.
Simplicity has been beating complexity for a long time. In fact, the preference for simplicity dates back to ancient Greece. Centuries later, a fourteenth-century friar named William of Occam formalized the preference for simplicity in a philosophy known as Occam's razor. This philosophy remains an essential underlying principle of many sciences, including machine learning.
Exercises: Check your understanding
-
You are developing a physics equation. Which of the following formulas conform more closely to Occam's Razor?
- A formula with twelve variables.
-
A formula with three variables.
Answer: A formula with three variables.
Three variables is more Occam-friendly than twelve variables.
-
You're on a brand-new machine learning project, about to select your first features. How many features should you pick?
- Pick 1–3 features that seem to have strong predictive power.
- Pick as many features as you can, so you can start observing which features have the strongest predictive power.
-
Pick 4–6 features that seem to have strong predictive power.
Answer: Pick 1–3 features that seem to have strong predictive power.
It's best for your data collection pipeline to start with only one or two features. This will help you confirm that the ML model works as intended. Also, when you build a baseline from a couple of features, you'll feel like you're making progress!
Regularization
Machine learning models must simultaneously meet two conflicting goals:
- Fit data well.
- Fit data as simply as possible.
One approach to keeping a model simple is to penalize complex models; that is, to force the model to become simpler during training. Penalizing complex models is one form of regularization.
Loss and complexity
So far, this course has suggested that the only goal when training was to minimize loss; that is:
\(\text{minimize(loss)}\)
As you've seen, models focused solely on minimizing loss tend to overfit. A better training optimization algorithm minimizes some combination of loss and complexity:
\(\text{minimize(loss + complexity)}\)
Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases. You should find a reasonable middle ground where the model makes good predictions on both the training data and real-world data. That is, your model should find a reasonable compromise between loss and complexity.
What is complexity?
You've already seen a few different ways of quantifying loss. How would you quantify complexity? Start your exploration through the following exercise:
Exercise: Check your intuition
So far, we've been pretty vague about what complexity actually is. Which of the following ideas do you think would be reasonable complexity metrics?
- Complexity is a function of the square of the model's weights.
- Complexity is a function of the biases of all the features in the model.
-
Complexity is a function of the model's weights.
Answer: (1, 3)
- Yes, you can measure some models' complexity this way. This metric is called L2 regularization.
- Yes, this is one way to measure some models' complexity. This metric is called L1 regularization.
Source: Google for Developers, https://developers.google.com/machine-learning/crash-course/overfitting/model-complexity
This work is licensed under a Creative Commons Attribution 4.0 License.
Regularization
L2 regularization is a popular regularization metric, which uses the following formula:
\(L_2\text{ regularization } = {w_1^2 + w_2^2 + ... + w_n^2}\)
For example, the following table shows the calculation of L2 regularization for a model with six weights:
| Value | Squared value | |
|---|---|---|
| w1 | 0.2 | 0.04 |
| w2 | -0.5 | 0.25 |
| w3 | 5.0 | 25.0 |
| w4 | -1.2 | 1.44 |
| w5 | 0.3 | 0.09 |
| w6 | -0.1 | 0.01 |
| 26.83 = total |
Notice that weights close to zero don't affect L2 regularization much, but large weights can have a huge impact. For example, in the preceding calculation:
- A single weight (w3) contributes about 93% of the total complexity.
- The other five weights collectively contribute only about 7% of the total complexity.
L2 regularization encourages weights toward 0, but never pushes weights all the way to zero.
Exercises: Check your understanding
-
If you use L2 regularization while training a model, what will typically happen to the overall complexity of the model?
- The overall complexity of the model will probably stay constant.
- The overall complexity of the model will probably increase.
-
The overall complexity of the system will probably drop.
Answer: The overall complexity of the system will probably drop.
Since L2 regularization encourages weights towards 0, the overall complexity will probably drop.
-
If you use L2 regularization while training a model, some features will be removed from the model.
- True
-
False
Answer: False
L2 regularization never pushes weights all the way to zero.
Regularization rate (lambda)
As noted, training attempts to minimize some combination of loss and complexity:
\(\text{minimize(loss} + \text{ complexity)}\)
Model developers tune the overall impact of complexity on model training by multiplying its value by a scalar called the regularization rate. The Greek character lambda typically symbolizes the regularization rate.
That is, model developers aim to do the following:
\(\text{minimize(loss} + \lambda \text{ complexity)}\)
A high regularization rate:
- Strengthens the influence of regularization, thereby reducing the chances of overfitting.
- Tends to produce a histogram of model weights having the following
characteristics:
- a normal distribution
- a mean weight of 0.
A low regularization rate:
- Lowers the influence of regularization, thereby increasing the chances of overfitting.
- Tends to produce a histogram of model weights with a flat distribution.
For example, the histogram of model weights for a high regularization rate might look as shown in Figure 18.
In contrast, a low regularization rate tends to yield a flatter histogram, as shown in Figure 19.
Picking the regularization rate
The ideal regularization rate produces a model that generalizes well to new, previously unseen data. Unfortunately, that ideal value is data-dependent, so you must do some tuning.
Early stopping: an alternative to complexity-based regularization
Early stopping is a regularization method that doesn't involve a calculation of complexity. Instead, early stopping simply means ending training before the model fully converges. For example, you end training when the loss curve for the validation set starts to increase (slope becomes positive).
Although early stopping usually increases training loss, it can decrease test loss.
Early stopping is a quick, but rarely optimal, form of regularization. The resulting model is very unlikely to be as good as a model trained thoroughly on the ideal regularization rate.
Finding equilibrium between learning rate and regularization rate
Learning rate and regularization rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero; a high regularization rate pulls weights towards zero.
If the regularization rate is high with respect to the learning rate, the weak weights tend to produce a model that makes poor predictions. Conversely, if the learning rate is high with respect to the regularization rate, the strong weights tend to produce an overfit model.
Your goal is to find the equilibrium between learning rate and regularization rate. This can be challenging. Worst of all, once you find that elusive balance, you may have to ultimately change the learning rate. And, when you change the learning rate, you'll again have to find the ideal regularization rate.
Interpreting Loss Curves
Machine learning would be much simpler if all your loss curves looked like this the first time you trained your model:

Unfortunately, loss curves are often challenging to interpret. Use your intuition about loss curves to solve the exercises on this page.
Exercise 1: Oscillating loss curve

What three things could you do to try improve the loss curve shown in Figure 21?
- Increase the number of examples in the training set.
- Reduce the training set to a tiny number of trustworthy examples.
- Reduce the learning rate.
- Check your data against a data schema to detect bad examples, and then remove the bad examples from the training set.
- Increase the learning rate.
Answer: 1, 2 & 3
- Reduce the training set to a tiny number of trustworthy examples.
Although this technique sounds artificial, it is actually a good idea. Assuming that the model converges on the small set of trustworthy examples, you can then gradually add more examples, perhaps discovering which examples cause the loss curve to oscillate.
- Reduce the learning rate.
Yes, reducing learning rate is often a good idea when debugging a training problem.
- Check your data against a data schema to detect bad examples, and then remove the bad examples from the training set.
Yes, this is a good practice for all models.
Exercise 2. Loss curve with a sharp jump

Which two of the following statements identify possible reasons for the exploding loss shown in Figure 22?
- The input data contains one or more NaNs - for example, a value caused by a division by zero.
- The regularization rate is too high.
- The input data contains a burst of outliers.
- The learning rate is too low.
Answer: 1, 2
- The input data contains one or more NaNs - for example, a value caused by a division by zero.
This is more common than you might expect.
- The input data contains a burst of outliers.
Sometimes, due to improper shuffling of batches, a batch might contain a lot of outliers.
Exercise 3. Test loss diverges from training loss

Which one of the following statements best identifies the reason for this difference between the loss curves of the training and test sets?
- The model is overfitting the training set.
- The learning rate is too high.
Answer: The model is overfitting the training set.
Yes, it probably is. Possible solutions:
- Make the model simpler, possibly by reducing the number of features.
- Increase the regularization rate.
- Ensure that the training set and test set are statistically equivalent.
Exercise 4. Loss curve gets stuck

Which one of the following statements is the most likely explanation for the erratic loss curve shown in Figure 24?
- The training set contains repetitive sequences of examples.
- The training set contains too many features.
- The regularization rate is too high.
Answer: The training set contains repetitive sequences of examples.
This is a possibility. Ensure that you are shuffling examples
sufficiently.