Accuracy, Recall, Precision, and Related Metrics

Description

This is a book resource with multiple pages. Navigate between the pages using the buttons.

Accuracy, Recall, Precision, and Related Metrics

When evaluating your classification models, relying solely on accuracy can be misleading. Imagine you've built a model for detecting a rare disease present in only 0.1% of samples: a model that simply predicts "no disease" for everyone would achieve 99.9% accuracy while being completely useless! Instead, consider precision and recall to understand your model's true performance. Precision measures how many of your positive predictions are actually correct. Precision is critical when false positives are costly, such as flagging legitimate transactions as fraudulent. Recall shows how many actual positive cases your model catches. Recall is essential when missing a positive can be dangerous, like failing to detect cancer in a medical scan.

These metrics often involve trade-offs. In email filtering, high precision means fewer legitimate emails in your spam folder, while high recall ensures important messages aren't missed. The receiver operating characteristic (ROC) curve helps visualize these trade-offs across different thresholds, with the area under the curve (AUC) summarizing your model's ability to distinguish between classes. An AUC close to 1 indicates excellent discrimination, while 0.5 suggests your model is no better than random guessing.

When deciding which metric to prioritize, consider the real-world consequences of different types of errors in your specific application. Would you rather face angry customers whose transactions were wrongly declined (false positives) or absorb the cost of undetected fraud (false negatives)?

True and false positives and negatives are used to calculate several useful metrics for evaluating models. Which evaluation metrics are most meaningful depends on the specific model and the specific task, the cost of different misclassifications, and whether the dataset is balanced or imbalanced.

All of the metrics in this section are calculated at a single fixed threshold, and change when the threshold changes. Very often, the user tunes the threshold to optimize one of these metrics.

Accuracy

Accuracy is the proportion of all classifications that were correct, whether positive or negative. It is mathematically defined as:

\(\text{Accuracy} =
\frac{\text{correct classifications}}{\text{total classifications}}
= \frac{TP+TN}{TP+TN+FP+FN}\)

In the spam classification example, accuracy measures the fraction of all emails correctly classified.

A perfect model would have zero false positives and zero false negatives and therefore an accuracy of 1.0, or 100%.

Because it incorporates all four outcomes from the confusion matrix (TP, FP, TN, FN), given a balanced dataset, with similar numbers of examples in both classes, accuracy can serve as a coarse-grained measure of model quality. For this reason, it is often the default evaluation metric used for generic or unspecified models carrying out generic or unspecified tasks.

However, when the dataset is imbalanced, or where one kind of mistake (FN or FP) is more costly than the other, which is the case in most real-world applications, it's better to optimize for one of the other metrics instead.

For heavily imbalanced datasets, where one class appears very rarely, say 1% of the time, a model that predicts negative 100% of the time would score 99% on accuracy, despite being useless.

Recall, or true positive rate

The true positive rate (TPR), or the proportion of all actual positives that were classified correctly as positives, is also known as recall.

Recall is mathematically defined as:

\(\text{Recall (or TPR)} =
\frac{\text{correctly classified actual positives}}{\text{all actual positives}}
= \frac{TP}{TP+FN}\)

False negatives are actual positives that were misclassified as negatives, which is why they appear in the denominator. In the spam classification example, recall measures the fraction of spam emails that were correctly classified as spam. This is why another name for recall is probability of detection: it answers the question "What fraction of spam emails are detected by this model?"

A hypothetical perfect model would have zero false negatives and therefore a recall (TPR) of 1.0, which is to say, a 100% detection rate.

In an imbalanced dataset where the number of actual positives is very low, recall is a more meaningful metric than accuracy because it measures the ability of the model to correctly identify all positive instances. For applications like disease prediction, correctly identifying the positive cases is crucial. A false negative typically has more serious consequences than a false positive. For a concrete example comparing recall and accuracy metrics, see the notes in the definition of recall.

False positive rate

The false positive rate (FPR) is the proportion of all actual negatives that were classified incorrectly as positives, also known as the probability of false alarm. It is mathematically defined as:

\(\text{FPR} =
\frac{\text{incorrectly classified actual negatives}}
{\text{all actual negatives}}
= \frac{FP}{FP+TN}\)

False positives are actual negatives that were misclassified, which is why they appear in the denominator. In the spam classification example, FPR measures the fraction of legitimate emails that were incorrectly classified as spam, or the model's rate of false alarms.

A perfect model would have zero false positives and therefore a FPR of 0.0, which is to say, a 0% false alarm rate.

In an imbalanced dataset where the number of actual negatives is very, very low, say 1-2 examples in total, FPR is less meaningful and less useful as a metric.

Precision

Precision is the proportion of all the model's positive classifications that are actually positive. It is mathematically defined as:

\(\text{Precision} =
\frac{\text{correctly classified actual positives}}
{\text{everything classified as positive}}
= \frac{TP}{TP+FP}\)

In the spam classification example, precision measures the fraction of emails classified as spam that were actually spam.

A hypothetical perfect model would have zero false positives and therefore a precision of 1.0.

In an imbalanced dataset where the number of actual positives is very, very low, say 1-2 examples in total, precision is less meaningful and less useful as a metric.

Precision improves as false positives decrease, while recall improves when false negatives decrease. But as seen in the previous section, increasing the classification threshold tends to decrease the number of false positives and increase the number of false negatives, while decreasing the threshold has the opposite effects. As a result, precision and recall often show an inverse relationship, where improving one of them worsens the other.

What does NaN mean in the metrics?

NaN, or "not a number," appears when dividing by 0, which can happen with any of these metrics. When TP and FP are both 0, for example, the formula for precision has 0 in the denominator, resulting in NaN. While in some cases NaN can indicate perfect performance and could be replaced by a score of 1.0, it can also come from a model that is practically useless. A model that never predicts positive, for example, would have 0 TPs and 0 FPs and thus a calculation of its precision would result in NaN.

Choice of metric and tradeoffs

The metric(s) you choose to prioritize when evaluating the model and choosing a threshold depend on the costs, benefits, and risks of the specific problem. In the spam classification example, it often makes sense to prioritize recall, nabbing all the spam emails, or precision, trying to ensure that spam-labeled emails are in fact spam, or some balance of the two, above some minimum accuracy level.

Model Evaluation Metrics and Usage Guidance

Metric	Guidance
Accuracy	Use as a rough indicator of model training progress/convergence for balanced datasets. For model performance, use only in combination with other metrics. Avoid for imbalanced datasets. Consider using another metric.
Recall (True positive rate)	Use when false negatives are more expensive than false positives.
False positive rate	Use when false positives are more expensive than false negatives.
Precision	Use when it's very important for positive predictions to be accurate.

(Optional, advanced) F1 score

The F1 score is the harmonic mean (a kind of average) of precision and recall.

Mathematically, it is given by:

\(\text{F1}=2*\frac{\text{precision * recall}}{\text{precision + recall}}
= \frac{2\text{TP}}{2\text{TP + FP + FN}}\)

This metric balances the importance of precision and recall, and is preferable to accuracy for class-imbalanced datasets. When precision and recall both have perfect scores of 1.0, F1 will also have a perfect score of 1.0. More broadly, when precision and recall are close in value, F1 will be close to their value. When precision and recall are far apart, F1 will be similar to whichever metric is worse.

Exercise: Check your understanding

A model outputs 5 TP, 6 TN, 3 FP, and 2 FN. Calculate the recall.
1. 0.714
2. 0.625
3. 0.455
Answer: 0.714

Recall is calculated as \(\frac{TP}{TP+FN}=\frac{5}{7}\)
A model outputs 3 TP, 4 TN, 2 FP, and 1 FN. Calculate the precision.
1. 0.75
2. 0.6
3. 0.429
Answer: 0.6

Precision is calculated as \(\frac{TP}{TP+FP}=\frac{3}{5}\)
You're building a binary classifier that checks photos of insect traps for whether a dangerous invasive species is present. If the model detects the species, the entomologist (insect scientist) on duty is notified. Early detection of this insect is critical to preventing an infestation. A false alarm (false positive) is easy to handle: the entomologist sees that the photo was misclassified and marks it as such. Assuming an acceptable accuracy level, which metric should this model be optimized for?
1. Precision
2. Recall
3. False positive rate (FPR)
Answer: Recall
In this scenario, false alarms (FP) are low-cost, and false negatives are highly costly, so it makes sense to maximize recall, or the probability of detection.

Source: Google for Developers, https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
This work is licensed under a Creative Commons Attribution 4.0 License.

ROC and AUC

The previous section presented a set of model metrics, all calculated at a single classification threshold value. But if you want to evaluate a model's quality across all possible thresholds, you need different tools.

Receiver-operating characteristic curve (ROC)

The ROC curve is a visual representation of model performance across all thresholds. The long version of the name, receiver operating characteristic, is a holdover from WWII radar detection.

The ROC curve is drawn by calculating the true positive rate (TPR) and false positive rate (FPR) at every possible threshold (in practice, at selected intervals), then graphing TPR over FPR. A perfect model, which at some threshold has a TPR of 1.0 and a FPR of 0.0, can be represented by either a point at (0, 1) if all other thresholds are ignored, or by the following:

Figure 1. A graph of TPR (y-axis) against FPR (x-axis) showing the
performance of a perfect model: a line from (0,1) to (1,1). — **Figure 1.** ROC and AUC of a hypothetical perfect model.

Area under the curve (AUC)

The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

The perfect model above, containing a square with sides of length 1, has an area under the curve (AUC) of 1.0. This means there is a 100% probability that the model will correctly rank a randomly chosen positive example higher than a randomly chosen negative example. In other words, looking at the spread of data points below, AUC gives the probability that the model will place a randomly chosen square to the right of a randomly chosen circle, independent of where the threshold is set.

Widget data line without slider

In more concrete terms, a spam classifier with AUC of 1.0 always assigns a random spam email a higher probability of being spam than a random legitimate email. The actual classification of each email depends on the threshold that you choose.

For a binary classifier, a model that does exactly as well as random guesses or coin flips has a ROC that is a diagonal line from (0,0) to (1,1). The AUC is 0.5, representing a 50% probability of correctly ranking a random positive and negative example.

In the spam classifier example, a spam classifier with AUC of 0.5 assigns a random spam email a higher probability of being spam than a random legitimate email only half the time.

Figure 2. A graph of TPR (y-axis) against FPR (x-axis) showing the
performance of a random 50-50 guesser: a diagonal line from (0,0)
to (1,1). — **Figure 2.** ROC and AUC of completely random guesses.

(Optional, advanced) Precision-recall curve

AUC and ROC work well for comparing models when the dataset is roughly balanced between classes. When the dataset is imbalanced, precision-recall curves (PRCs) and the area under those curves may offer a better comparative visualization of model performance. Precision-recall curves are created by plotting precision on the y-axis and recall on the x-axis across all thresholds.

Precision-recall curve

AUC and ROC for choosing model and threshold

AUC is a useful measure for comparing the performance of two different models, as long as the dataset is roughly balanced. The model with greater area under the curve is generally the better one.

Figure 3.a. ROC/AUC graph of a model with AUC=0.65. — **Figure 3.** ROC and AUC of two hypothetical models. The curve on the right, with a greater AUC, represents the better of the two models.

Figure 3.b. ROC/AUC graph of a model with AUC=0.93. — **Figure 3.** ROC and AUC of two hypothetical models. The curve on the right, with a greater AUC, represents the better of the two models.

The points on a ROC curve closest to (0,1) represent a range of the best-performing thresholds for the given model. As discussed in the Thresholds, Confusion matrix and Choice of metric and tradeoffs sections, the threshold you choose depends on which metric is most important to the specific use case. Consider the points A, B, and C in the following diagram, each representing a threshold:

Figure 4. A ROC curve of AUC=0.84 showing three points on the
convex part of the curve closest to (0,1) labeled A, B, C in order. — **Figure 4.** Three labeled points representing thresholds.

If false positives (false alarms) are highly costly, it may make sense to choose a threshold that gives a lower FPR, like the one at point A, even if TPR is reduced. Conversely, if false positives are cheap and false negatives (missed true positives) highly costly, the threshold for point C, which maximizes TPR, may be preferable. If the costs are roughly equivalent, point B may offer the best balance between TPR and FPR.

Exercise: Check your understanding

In practice, ROC curves are much less regular than the illustrations given above.

Which of the following models, represented by their ROC curve and AUC, has the best performance?

Answer: (4)

This model has the highest AUC, which corresponds with the best performance.

Which of the following models performs worse than chance?

Answer: (2)
This model has an AUC lower than 0.5, which means it performs worse than chance.

(Optional, advanced) Bonus question

Imagine a situation where it's better to allow some spam to reach the inbox than to send a business-critical email to the spam folder. You've trained a spam classifier for this situation where the positive class is spam and the negative class is not-spam. Which of the following points on the ROC curve for your classifier is preferable?

A ROC curve of AUC=0.84 showing three points on the convex part of
the curve that are close to (0,1). Point A is at approximately
(0.25, 0.75). Point B is at approximately (0.30, 0.90), and is
the point that maximizes TPR while minimizing FPR. Point
C is at approximately (0.4, 0.95).

Point A
Point B
Point C

Answer: Point A
In this use case, it's better to minimize false positives, even if true positives also decrease.

Site:	Saylor University
Course:	CS207: Fundamentals of Machine Learning
Book:	Accuracy, Recall, Precision, and Related Metrics

Printed by:	Guest user
Date:	Thursday, July 16, 2026, 9:55 AM

Accuracy, Recall, Precision, and Related Metrics

Description

Table of contents