Applying Clustering

This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clutering performance. 

Agglomerative Clustering

The following algorithm is the simplest and easiest to understand among all the clustering algorithms without a fixed number of clusters.

The algorithm is fairly simple:

  1. We start by assigning each observation to its own cluster
  2. Then sort the pairwise distances between the centers of clusters in descending order
  3. Take the nearest two neighbor clusters and merge them together, and recompute the centers
  4. Repeat steps 2 and 3 until all the data is merged into one cluster

The process of searching for the nearest cluster can be conducted with different methods of bounding the observations:

  1. Single linkage d(C_i, C_j) = min_{x_i \in C_i, x_j \in C_j} ||x_i - x_j||
  2. Complete linkage d(C_i, C_j) = max_{x_i \in C_i, x_j \in C_j} ||x_i - x_j||
  3. Average linkage d(C_i, C_j) = \frac{1}{n_i n_j} \sum_{x_i \in C_i} \sum_{x_j \in C_j} ||x_i - x_j||
  4. Centroid linkage d(C_i, C_j) = ||\mu_i - \mu_j||

The 3rd one is the most effective in computation time since it does not require recomputing the distances every time the clusters are merged.

The results can be visualized as a beautiful cluster tree (dendrogram) to help recognize the moment the algorithm should be stopped to get optimal results. There are plenty of Python tools to build these dendrograms for agglomerative clustering.

Let's consider an example with the clusters we got from K-means:

from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

X = np.zeros((150, 2))

np.random.seed(seed=42)
X[:50, 0] = np.random.normal(loc=0.0, scale=.3, size=50)
X[:50, 1] = np.random.normal(loc=0.0, scale=.3, size=50)

X[50:100, 0] = np.random.normal(loc=2.0, scale=.5, size=50)
X[50:100, 1] = np.random.normal(loc=-1.0, scale=.2, size=50)

X[100:150, 0] = np.random.normal(loc=-1.0, scale=.2, size=50)
X[100:150, 1] = np.random.normal(loc=2.0, scale=.5, size=50)

# pdist will calculate the upper triangle of the pairwise distance matrix
distance_mat = pdist(X)
# linkage — is an implementation if agglomerative algorithm
Z = hierarchy.linkage(distance_mat, 'single')
plt.figure(figsize=(10, 5))
dn = hierarchy.dendrogram(Z, color_threshold=0.5)