8.3: Hierarchical Clustering
In this section, you will learn about hierarchical clustering and, in particular, agglomerative clustering. In contrast to K-means, this methodology does not require you to know the number of clusters in advance. This information is generated from a dendrogram created by the algorithm. As clusters of points are created, notions of the distance between two sets (that is, the "linkage') must be understood when applying this algorithm. You should already know how to compute the Euclidean distance between two points. This article also points out that there are many ways to compute the distance between points (Manhattan, maximum, Mahalanobois, etc.). We can also use these functions for point distances to compute the distance between two sets of points. For example, single linkage computes set distances by choosing the two closest points. Complete linkage chooses the two most distant points. Average distance computes the average of all distances between all points from both sets and so on. Read through this article to get an overview of hierarchical clustering.
Here is a visual introduction to hierarchical clustering that walks you through a practical example.
Work through this example to draw a line from the agglomerative clustering algorithm to its equivalent Python implementation using scikit-learn. Pay attention to how the data sets are created and how they relate to each iteration as the clusters gradually form. Use the dendrogram to determine the best number of clusters and compare your result to the distribution of the original data. Try to take the extra step of generating a scatter plot using your visualization knowledge.
This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clutering performance.
This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clustering performance.