Unit 6: Supervised Learning – Classification | CS207 Study Guide

Unit 6: Supervised Learning – Classification

6a. Implement logistic regression models

What role does the sigmoid function play in converting linear outputs to probabilities in logistic regression?
How does logistic regression handle binary classification differently from linear regression?
Why is log loss used instead of mean squared error to train logistic regression models?

To implement a logistic regression (a statistical method used for binary classification that models the probability of class membership using the logistic function rather than predicting continuous values directly), we use the sigmoid function, which is a mathematical function that maps any real-valued input to an output between 0 and 1, creating a smooth S-shaped curve, defined as σ(z) = 1 / (1 + e^(-z)). This transformation is essential when predicting binary outcomes, such as spam vs. non-spam emails. For instance, a model output of 2.5 becomes σ(2.5) ≈ 0.92, or a 92% probability of the positive class. Unlike linear regression, which predicts continuous values and may yield invalid probabilities (like -0.3 or 1.2), logistic regression ensures outputs are valid for classification.

During training, the model minimizes log loss (also known as cross-entropy loss), which is a loss function specifically designed for classification problems that measures the difference between predicted probabilities and actual class labels by penalizing incorrect predictions based on confidence. Log loss penalizes incorrect predictions based on confidence: highly confident but wrong predictions (like predicting 0.99 for a true label of 0) are punished more severely than slightly incorrect ones. Mean squared error (MSE) is not ideal for classification, as it leads to slower convergence and treats all errors equally, regardless of confidence. A threshold (typically 0.5) is used to convert predicted probabilities into class labels: if σ(z) ≥ 0.5, the prediction is class 1; otherwise, class 0. However, this threshold may be adjusted in contexts like medical diagnosis, where reducing false negatives is more important than accuracy. A common mistake is misapplying linear regression for classification tasks.

To review, see:

Logistic Regression: Sigmoid Function

6b. Explain the fundamentals of classification, including thresholding and confusion metrics, to assess model performance

How does classification fundamentally differ from regression, and what types of problems require classification?
Why is thresholding necessary in logistic regression, and how does adjusting the threshold impact false positives vs. false negatives?
How do the four components of a confusion matrix (true positives, false positives, true negatives, and false negatives) quantify different types of prediction errors?

Classification differs from regression by predicting categorical rather than continuous outcomes; for example, classifying emails as "spam" or "not spam" rather than predicting a numerical spam score. Problems like medical diagnosis, fraud detection, and sentiment analysis are better suited for classification because they involve discrete labels. In logistic regression, the model outputs probabilities, which must be converted into class labels using a threshold(commonly 0.5). Adjusting this threshold shifts the balance between false positives (FP) and false negatives (FN). Lowering the threshold increases sensitivity (more true cases detected) but may trigger more FPs, while raising it reduces FPs but risks missing actual positives (more FNs).

The confusion matrix breaks down prediction results into four categories: true positives (TP) are correctly predicted positive cases, true negatives (TN) are correctly predicted negatives, false positives (FP) are negative cases incorrectly predicted as positive, and false negatives (FN) are positive cases the model missed.

These values help calculate key evaluation metrics: precision (TP / (TP + FP)) measures how reliable positive predictions are, while recall (TP / (TP + FN)) measures how well the model captures all actual positives. For example, if a spam filter identifies 80 spam emails correctly (TP), marks 5 legitimate emails as spam (FP), and misses 15 spam emails (FN), then precision = 80 / (80 + 5) ≈ 94% and recall = 80 / (80 + 15) ≈ 84%.

Many people often confuse precision and recall. Clarify that precision asks "When the model predicts positive, how often is it right?", while recall asks "Of all actual positives, how many did the model catch?". Additionally, you may overlook the business context when choosing a threshold; for example, a false negative can be far more costly in cancer detection than a false positive. Visual tools like ROC curves can help you understand the trade-offs involved in threshold selection across different operating points.

To review, see:

Classification

6c. Explain classification models using metrics like accuracy, precision, recall, and F1-score

Why is accuracy misleading for imbalanced datasets (such as fraud detection with 99% legitimate transactions)?
How do precision and recall measure conflicting aspects of model performance?
Why does the F1 score provide a more reliable metric than accuracy when class distribution is skewed?
How do ROC curves visualize the trade-off between true positives and false positives across classification thresholds?

Evaluating classification models requires more than just accuracy, which is calculated as (TP + TN) / total predictions. Although intuitive, accuracy becomes misleading for imbalanced datasets. For example, in fraud detection, if only 1% of transactions are fraudulent, a model labeling everything as legitimate achieves 99% accuracy but fails to identify any fraud. This limitation underscores the need for more nuanced metrics.

Precision, defined as TP / (TP + FP), evaluates the reliability of positive predictions: how often the model is correct when it predicts a positive case like fraud. Recall, or TP / (TP + FN), measures coverage of how well the model captures actual positive cases. These two metrics often conflict: increasing recall by capturing more positives may increase false positives, thereby lowering precision.

To balance these trade-offs, the F1 score provides a harmonic mean of precision and recall: F1 = 2 × (precision × recall) / (precision + recall). It is especially useful when a class imbalance exists, and both types of errors matter.

Receiver operating characteristic (ROC) curves help visualize model performance across different thresholds by plotting the true positive rate (recall) against the false positive rate (FPR = FP / (FP + TN)). The area under the curve (AUC) summarizes this curve into a single value: 0.5 indicates random guessing, while 1.0 represents perfect classification.

To review, see:

Evaluating Classification Model

Unit 6 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

accuracy
area under the curve (AUC)
confusion matrix
F1 score
false negative (FN)
false positive (FP)
logistic regression
log loss
precision
recall
receiver operating characteristic (ROC) curve
sigmoid function
true negative (TN)
true positive (TP)