Model Complexity

Description

This is a book resource with multiple pages. Navigate between the pages using the buttons.

A model predicting house prices might perform exceptionally well on historical data but struggle to accurately predict new properties. Similarly, in fraud detection, a model trained on past fraudulent activities might fail to detect new, unknown forms of fraud. These examples illustrate how overfitting can hinder a model's ability to generalize to new, unseen data. In this section, you will explore strategies to mitigate overfitting, such as controlling model complexity, applying regularization techniques, and interpreting loss curves.

As you read, reflect on these questions: What does it mean if training loss decreases while validation loss increases? How can early stopping or tuning hyperparameters help mitigate overfitting? How do loss curves help distinguish between underfitting and overfitting?

The previous unit introduced the following model, which miscategorized a lot of trees in the test set:

Figure 16. The same image as Figure 13. This is a complex shape that
miscategorizes many trees. — **Figure 16.** The misbehaving complex model from the previous unit.

The preceding model contains a lot of complex shapes. Would a simpler model handle new data better? Suppose you replace the complex model with a ridiculously simple model--a straight line.

Figure 17. A straight line model that does an excellent job
separating the sick trees from the healthy trees. — **Figure 17.** A much simpler model.

The simple model generalizes better than the complex model on new data. That is, the simple model made better predictions on the test set than the complex model.

Simplicity has been beating complexity for a long time. In fact, the preference for simplicity dates back to ancient Greece. Centuries later, a fourteenth-century friar named William of Occam formalized the preference for simplicity in a philosophy known as Occam's razor. This philosophy remains an essential underlying principle of many sciences, including machine learning.

Exercises: Check your understanding

You are developing a physics equation. Which of the following formulas conform more closely to Occam's Razor?
1. A formula with twelve variables.
2. A formula with three variables.
  Answer: A formula with three variables.
  
  Three variables is more Occam-friendly than twelve variables.
You're on a brand-new machine learning project, about to select your first features. How many features should you pick?
1. Pick 1–3 features that seem to have strong predictive power.
2. Pick as many features as you can, so you can start observing which features have the strongest predictive power.
3. Pick 4–6 features that seem to have strong predictive power.
  Answer: Pick 1–3 features that seem to have strong predictive power.
  It's best for your data collection pipeline to start with only one or two features. This will help you confirm that the ML model works as intended. Also, when you build a baseline from a couple of features, you'll feel like you're making progress!

Regularization

Machine learning models must simultaneously meet two conflicting goals:

Fit data well.
Fit data as simply as possible.

One approach to keeping a model simple is to penalize complex models; that is, to force the model to become simpler during training. Penalizing complex models is one form of regularization.

Loss and complexity

So far, this course has suggested that the only goal when training was to minimize loss; that is:

\(\text{minimize(loss)}\)

As you've seen, models focused solely on minimizing loss tend to overfit. A better training optimization algorithm minimizes some combination of loss and complexity:

\(\text{minimize(loss + complexity)}\)

Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases. You should find a reasonable middle ground where the model makes good predictions on both the training data and real-world data. That is, your model should find a reasonable compromise between loss and complexity.

What is complexity?

You've already seen a few different ways of quantifying loss. How would you quantify complexity? Start your exploration through the following exercise:

Exercise: Check your intuition

So far, we've been pretty vague about what complexity actually is. Which of the following ideas do you think would be reasonable complexity metrics?

Complexity is a function of the square of the model's weights.
Complexity is a function of the biases of all the features in the model.
Complexity is a function of the model's weights.
Answer: (1, 3)
- Yes, you can measure some models' complexity this way. This metric is called L₂ regularization.
- Yes, this is one way to measure some models' complexity. This metric is called L₁ regularization.

Source: Google for Developers, https://developers.google.com/machine-learning/crash-course/overfitting/model-complexity
This work is licensed under a Creative Commons Attribution 4.0 License.

Regularization

L₂ regularization is a popular regularization metric, which uses the following formula:

\(L_2\text{ regularization } = {w_1^2 + w_2^2 + ... + w_n^2}\)

For example, the following table shows the calculation of L₂ regularization for a model with six weights:

Weight Values and Their Squares

	Value	Squared value
w₁	0.2	0.04
w₂	-0.5	0.25
w₃	5.0	25.0
w₄	-1.2	1.44
w₅	0.3	0.09
w₆	-0.1	0.01
		26.83 = total

Notice that weights close to zero don't affect L₂ regularization much, but large weights can have a huge impact. For example, in the preceding calculation:

A single weight (w₃) contributes about 93% of the total complexity.
The other five weights collectively contribute only about 7% of the total complexity.

L₂ regularization encourages weights toward 0, but never pushes weights all the way to zero.

Exercises: Check your understanding

If you use L₂ regularization while training a model, what will typically happen to the overall complexity of the model?
1. The overall complexity of the model will probably stay constant.
2. The overall complexity of the model will probably increase.
3. The overall complexity of the system will probably drop.
  
  Answer: The overall complexity of the system will probably drop.
  
  Since L₂ regularization encourages weights towards 0, the overall complexity will probably drop.
If you use L₂ regularization while training a model, some features will be removed from the model.
1. True
2. False
Answer: False
L₂ regularization never pushes weights all the way to zero.

Regularization rate (lambda)

As noted, training attempts to minimize some combination of loss and complexity:

\(\text{minimize(loss} + \text{ complexity)}\)

Model developers tune the overall impact of complexity on model training by multiplying its value by a scalar called the regularization rate. The Greek character lambda typically symbolizes the regularization rate.

That is, model developers aim to do the following:

\(\text{minimize(loss} + \lambda \text{ complexity)}\)

A high regularization rate:

Strengthens the influence of regularization, thereby reducing the chances of overfitting.
Tends to produce a histogram of model weights having the following characteristics:
- a normal distribution
- a mean weight of 0.

A low regularization rate:

Lowers the influence of regularization, thereby increasing the chances of overfitting.
Tends to produce a histogram of model weights with a flat distribution.

For example, the histogram of model weights for a high regularization rate might look as shown in Figure 18.

Figure 18. Histogram of a model's weights with a mean of zero and
a normal distribution. — **Figure 18.** Weight histogram for a high regularization rate. Mean is zero. Normal distribution.

In contrast, a low regularization rate tends to yield a flatter histogram, as shown in Figure 19.

Figure 19. Histogram of a model's weights with a mean of zero that
is somewhere between a flat distribution and a normal
distribution. — **Figure 19.** Weight histogram for a low regularization rate. Mean may or may not be zero.

Picking the regularization rate

The ideal regularization rate produces a model that generalizes well to new, previously unseen data. Unfortunately, that ideal value is data-dependent, so you must do some tuning.

Early stopping: an alternative to complexity-based regularization

Early stopping is a regularization method that doesn't involve a calculation of complexity. Instead, early stopping simply means ending training before the model fully converges. For example, you end training when the loss curve for the validation set starts to increase (slope becomes positive).

Although early stopping usually increases training loss, it can decrease test loss.

Early stopping is a quick, but rarely optimal, form of regularization. The resulting model is very unlikely to be as good as a model trained thoroughly on the ideal regularization rate.

Finding equilibrium between learning rate and regularization rate

Learning rate and regularization rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero; a high regularization rate pulls weights towards zero.

If the regularization rate is high with respect to the learning rate, the weak weights tend to produce a model that makes poor predictions. Conversely, if the learning rate is high with respect to the regularization rate, the strong weights tend to produce an overfit model.

Your goal is to find the equilibrium between learning rate and regularization rate. This can be challenging. Worst of all, once you find that elusive balance, you may have to ultimately change the learning rate. And, when you change the learning rate, you'll again have to find the ideal regularization rate.

Interpreting Loss Curves

Machine learning would be much simpler if all your loss curves looked like this the first time you trained your model:

Figure 20. A plot showing the ideal loss curve when training a
machine learning model. The loss curve plots loss on the y-axis
against the number of training steps on the x-axis. As the number
of training steps increases, loss begins high, then decreases
exponentially, and ultimately flattens out to reach a minimum
loss. — **Figure 20.** An ideal loss curve.

Unfortunately, loss curves are often challenging to interpret. Use your intuition about loss curves to solve the exercises on this page.

Exercise 1: Oscillating loss curve

Figure 21. A loss curve (loss on the y-axis; number of training
steps on the x-axis) in which the loss doesn't flatten out.
Instead, loss oscillates erratically. — **Figure 21.** Oscillating loss curve.

What three things could you do to try improve the loss curve shown in Figure 21?

Increase the number of examples in the training set.
Reduce the training set to a tiny number of trustworthy examples.
Reduce the learning rate.
Check your data against a data schema to detect bad examples, and then remove the bad examples from the training set.
Increase the learning rate.

Answer: 1, 2 & 3

Reduce the training set to a tiny number of trustworthy examples.

Although this technique sounds artificial, it is actually a good idea. Assuming that the model converges on the small set of trustworthy examples, you can then gradually add more examples, perhaps discovering which examples cause the loss curve to oscillate.
Reduce the learning rate.

Yes, reducing learning rate is often a good idea when debugging a training problem.
Check your data against a data schema to detect bad examples, and then remove the bad examples from the training set.
Yes, this is a good practice for all models.

Exercise 2. Loss curve with a sharp jump

Figure 22. A loss curve plot that shows the loss decreasing up to a
certain number of training steps and then suddenly increasing
with further training steps. — **Figure 22.** Sharp rise in loss.

Which two of the following statements identify possible reasons for the exploding loss shown in Figure 22?

The input data contains one or more NaNs - for example, a value caused by a division by zero.
The regularization rate is too high.
The input data contains a burst of outliers.
The learning rate is too low.

Answer: 1, 2

The input data contains one or more NaNs - for example, a value caused by a division by zero.

This is more common than you might expect.
The input data contains a burst of outliers.
Sometimes, due to improper shuffling of batches, a batch might contain a lot of outliers.

Exercise 3. Test loss diverges from training loss

Figure 23. The training loss curve appears to converge, but the
validation loss begins to rise after a certain number of training
steps. — **Figure 23.** Sharp rise in validation loss.

Which one of the following statements best identifies the reason for this difference between the loss curves of the training and test sets?

The model is overfitting the training set.
The learning rate is too high.

Answer: The model is overfitting the training set.

Yes, it probably is. Possible solutions:

Make the model simpler, possibly by reducing the number of features.
Increase the regularization rate.
Ensure that the training set and test set are statistically equivalent.

Exercise 4. Loss curve gets stuck

Figure 24. A plot of a loss curve showing the loss beginning to
converge with training but then displaying repeated patterns that
look like a rectangular wave. — **Figure 24.** Chaotic loss after a certain number of steps.

Which one of the following statements is the most likely explanation for the erratic loss curve shown in Figure 24?

The training set contains repetitive sequences of examples.
The training set contains too many features.
The regularization rate is too high.

Answer: The training set contains repetitive sequences of examples.

This is a possibility. Ensure that you are shuffling examples sufficiently.

Site:	Saylor University
Course:	CS207: Fundamentals of Machine Learning
Book:	Model Complexity

Printed by:	Guest user
Date:	Thursday, July 16, 2026, 10:59 AM

Model Complexity

Description

Table of contents