Unit 7: Unsupervised Learning – Clustering | CS207 Study Guide

Unit 7: Unsupervised Learning – Clustering

7a. Explain clustering and its types

What is clustering in unsupervised learning, and how does it differ from classification?
What are the four main types of clustering algorithms, and what characteristics of data or tasks guide their selection?
How does density-based clustering handle irregular shapes and noise compared to centroid-based methods like K-Means?

Clustering is an unsupervised learning technique that groups similar data points based on inherent patterns without predefined labels. Unlike classification, which assigns inputs to known categories (like spam or not spam), clustering discovers natural groupings in unlabeled data, making it ideal for exploratory tasks such as customer segmentation, anomaly detection, or image compression.

There are four primary types of clustering algorithms, each suited to different kinds of data and patterns. Centroid-based methods like K-Means clustering form spherical clusters around central points (centroids) and are efficient for large datasets with evenly sized, well-separated clusters. However, K-Means is sensitive to outliers and initial centroid placement and fails on irregular shapes.

Hierarchical clustering builds tree-like structures called dendrograms through agglomerative (bottom-up) or divisive (top-down) approaches, capturing nested relationships but scaling poorly with large data.

Distribution-based clustering, such as Gaussian mixture models (GMM), assumes the data comes from a combination of probability distributions, offering flexibility for elliptical clusters but requiring careful parameter tuning.

Density-based clustering algorithms like DBSCAN group data into dense regions separated by sparse areas and excel at handling non-globular clusters and noise, though they may struggle with varying densities.

Many people apply K-Means by default, even on unsuitable datasets, resulting in poor results. The importance of visual inspection and evaluation metrics like the silhouette score or Davies-Bouldin index to assess cluster quality. Introduce density-based methods early when working with spatial or geographic data, as they are robust to noise and non-linear boundaries.

To review, see:

Introduction to Clustering

7b. Implement the K-Means clustering algorithm using Python

What are the main steps of the K-Means clustering algorithm, and how does the algorithm converge?
How do we choose the optimal number of clusters (k) using techniques like the elbow method or silhouette score?
What are the common implementation pitfalls of K-Means, and how do outliers, feature scaling, and cluster shapes impact its performance?

The K-Means clustering algorithm is one of the most widely used unsupervised learning techniques for discovering inherent groupings in unlabeled data. It partitions the dataset into k clusters by minimizing the inertia (the sum of squared distances between each point and its assigned centroid). The algorithm begins with the random initialization of centroids, then follows an iterative process: (1) assign each data point to the nearest centroid based on Euclidean distance, (2) update each centroid to be the mean of the points in its cluster, and (3) repeat until convergence, where cluster assignments stabilize. K-Means assumes that clusters are roughly spherical, of similar size, and equally distributed. Conditions are often violated in real-world datasets. Therefore, it struggles with non-globular clusters, varying densities, and outliers, which can drastically skew centroids and distort clustering results.

To determine an appropriate value for k, you can use the elbow method, where the inertia is plotted against a range of k values to identify the point beyond which adding more clusters yields diminishing returns. Another helpful tool is the silhouette score, which measures how similar a point is to its own cluster compared to other clusters; a higher average score suggests more well-defined clusters. In practice, feature scaling (transforming variables to have similar scales or ranges, typically by standardizing them to have zero mean and unit variance) is critical, as K-Means is sensitive to variable magnitudes. Unscaled features can bias the clustering toward variables with larger numeric ranges. The algorithm is also non-deterministic due to random centroid initialization, so using multiple runs (n_init) helps avoid poor local minima. In Python, the KMeans class from scikit-learn simplifies implementation, offering options for initialization, distance metrics, number of iterations, and evaluation tools. A typical implementation involves importing libraries like pandas, matplotlib, and sklearn.cluster.KMeans, fitting the model on a preprocessed dataset, and visualizing clusters with scatter plots color-coded by labels.

Many people often confuse K-Means with classification because it uses labels post-training. These are not ground-truth classes but inferred groupings. Moreover, be careful not to default to K-Means even when data characteristics suggest that density-based or hierarchical methods would yield better insights.

To review, see:

K-Means Clustering

7c. Analyze clustering results to identify patterns in data

How do we interpret cluster centroids and feature distributions to understand the defining characteristics of each cluster?
What visual tools can help us evaluate the quality of clusters, and how do intra-cluster compactness and inter-cluster separation reflect clustering effectiveness?
How can we use clustering insights to support decision-making in real-world scenarios such as customer segmentation, market targeting, or anomaly detection?

Analyzing clustering results is critical in deriving actionable insights from unsupervised learning. After fitting a K-Means model, the primary tool for interpretation is the cluster centroid, which represents the average profile of each group. By examining the feature values of centroids (for example, high income and low age in one customer cluster), we identify defining characteristics and compare them across clusters. Effective analysis requires evaluating intra-cluster compactness (how tightly grouped the data points are within a cluster) and inter-cluster separation (how distinct clusters are from one another). Visualization tools such as scatter plots, pair plots, and parallel coordinate plots help uncover meaningful groupings and detect overlaps. A low inertia (sum of squared distances to centroids) suggests the model has formed cohesive clusters, but this metric alone is insufficient.

We often use silhouette analysis to assess how well each point fits within its assigned cluster versus others, with values close to +1 indicating well-separated clusters and values near 0 or negative values suggesting ambiguity or misclassification.

In spatial data, you should look for patterns, such as clusters representing niche customer segments, fraudulent transactions, or geographic groupings. For example, in retail, a cluster might reveal high-spending but infrequent buyers, enabling targeted promotions. Beginners often stop at labeling clusters without exploring business context or domain implications. Clustering is exploratory, and interpreting its results involves collaborations with subject matter experts to validate hypotheses and take informed actions. The value of clustering lies in grouping data and identifying interpretable patterns that inform real-world decisions.

To review, see:

Analyzing Clustering Results

Unit 7 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

agglomerative
centroid
cluster centroid
clustering
convergence
DBSCAN
density-based clustering
distribution-based clustering
divisive
elbow method
feature scaling
Gaussian mixture model (GMM)
hierarchical clustering
inertia
inter-cluster separation
intra-cluster compactness
K-means clustering
silhouette analysis
silhouette score