Clustering
| Site: | Saylor University |
| Course: | CS207: Fundamentals of Machine Learning |
| Book: | Clustering |
| Printed by: | Guest user |
| Date: | Wednesday, April 15, 2026, 7:12 PM |
Description
What is clustering?
These two resources address clustering. You will learn how clustering groups data points based on similarity. Pay attention to how different clustering methods assign data points. Different clustering algorithms use various strategies to define clusters. Understanding their strengths and weaknesses will help you choose the right one for your data.
Customer segmentation in marketing often involves clustering customers based on purchasing behavior. Anomaly detection in network security may involve identifying unusual patterns in network traffic using clustering techniques. As you explore clustering, try applying different algorithms to datasets from various domains to see how they perform. By understanding how each clustering algorithm works and considering its strengths and limitations, you will be able to choose the best approach for your data. This understanding will allow you to effectively extract valuable insights and solve real-world problems.
Suppose you are working with a dataset that includes patient information from a healthcare system. The dataset is complex and includes both categorical and numeric features. You want to find patterns and similarities in the dataset. How might you approach this task?
Clustering is an unsupervised machine learning technique designed to group unlabeled examples based on their similarity to each other. (If the examples are labeled, this kind of grouping is called classification.) Consider a hypothetical patient study designed to evaluate a new treatment protocol. During the study, patients report how many times per week they experience symptoms and the severity of the symptoms. Researchers can use clustering analysis to group patients with similar treatment responses into clusters. Figure 1 demonstrates one possible grouping of simulated data into three clusters.

Looking at the unlabeled data on the left of Figure 1, you could guess that the data forms three clusters, even without a formal definition of similarity between data points. In real-world applications, however, you need to explicitly define a similarity measure, or the metric used to compare samples, in terms of the dataset's features. When examples have only a couple of features, visualizing and measuring similarity is straightforward. But as the number of features increases, combining and comparing features becomes less intuitive and more complex. Different similarity measures may be more or less appropriate for different clustering scenarios, and this course will address choosing an appropriate similarity measure in later sections: Manual similarity measures and Similarity measure from embeddings.
After clustering, each group is assigned a unique label called a cluster ID. Clustering is powerful because it can simplify large, complex datasets with many features to a single cluster ID.
Clustering use cases
Clustering is useful in a variety of industries. Some common applications for clustering:
- Market segmentation
- Social network analysis
- Search result grouping
- Medical imaging
- Image segmentation
- Anomaly detection
Some specific examples of clustering:
- The Hertzsprung-Russell diagram shows clusters of stars when plotted by luminosity and temperature.
- Gene sequencing that shows previously unknown genetic similarities and dissimilarities between species has led to the revision of taxonomies previously based on appearances.
- The Big 5 model of personality traits was developed by clustering words that describe personality into 5 groups. The HEXACO model uses 6 clusters instead of 5.
Imputation
When some examples in a cluster have missing feature data, you can infer the missing data from other examples in the cluster. This is called imputation. For example, less popular videos can be clustered with more popular videos to improve video recommendations.
Data compression
As discussed, the relevant cluster ID can replace other features for all examples in that cluster. This substitution reduces the number of features and therefore also reduces the resources needed to store, process, and train models on that data. For very large datasets, these savings become significant.
To give an example, a single YouTube video can have feature data including:
- viewer location, time, and demographics
- comment timestamps, text, and user IDs
- video tags
Clustering YouTube videos replaces this set of features with a single cluster ID, thus compressing the data.
Privacy preservation
You can preserve privacy somewhat by clustering users and associating user data with cluster IDs instead of user IDs. To give one possible example, say you want to train a model on YouTube users' watch history. Instead of passing user IDs to the model, you could cluster users and pass only the cluster ID. This keeps individual watch histories from being attached to individual users. Note that the cluster must contain a sufficiently large number of users in order to preserve privacy.
Source: Google for Developers, https://developers.google.com/machine-learning/clustering/overview
This work is licensed under a Creative Commons Attribution 4.0 License.
Clustering Algorithms
Machine learning datasets can have millions of examples, but not all clustering algorithms scale efficiently. Many clustering algorithms compute the similarity between all pairs of examples, which means their runtime increases as the square of the number of examples \(n\), denoted as \(O(n^2)\) in complexity notation. \(O(n^2)\) algorithms are not practical for datasets with millions of examples.
The k-means algorithm has a complexity of \(O(n)\), meaning that the algorithm scales linearly with \(n\). This algorithm will be the focus of this course.Types of clustering
For an exhaustive list of different approaches to clustering, see A Comprehensive Survey of Clustering Algorithms Xu, D. & Tian, Y. Ann. Data. Sci. (2015) 2: 165. Each approach is best suited to a particular data distribution. This course briefly discusses four common approaches.
Centroid-based clustering
The centroid of a cluster is the arithmetic mean of all the points in the cluster. Centroid-based clustering organizes the data into non-hierarchical clusters. Centroid-based clustering algorithms are efficient but sensitive to initial conditions and outliers. Of these, k-means is the most widely used. It requires users to define the number of centroids, k, and works well with clusters of roughly equal size.
Density-based clustering
Density-based clustering connects contiguous areas of high example density into clusters. This allows for the discovery of any number of clusters of any shape. Outliers are not assigned to clusters. These algorithms have difficulty with clusters of different density and data with high dimensions.
Distribution-based clustering
This clustering approach assumes data is composed of probabilistic distributions, such as Gaussian distributions. In Figure 3, the distribution-based algorithm clusters data into three Gaussian distributions. As distance from the distribution's center increases, the probability that a point belongs to the distribution decreases. The bands show that decrease in probability. When you're not comfortable assuming a particular underlying distribution of the data, you should use a different algorithm.
Hierarchical clustering
Hierarchical clustering creates a tree of clusters. Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies. See Comparison of 61 Sequenced Escherichia coli Genomes by Oksana Lukjancenko, Trudy Wassenaar & Dave Ussery for an example. Any number of clusters can be chosen by cutting the tree at the right level.