Agglomerative Clustering

Work through this example to draw a line from the agglomerative clustering algorithm to its equivalent Python implementation using scikit-learn. Pay attention to how the data sets are created and how they relate to each iteration as the clusters gradually form. Use the dendrogram to determine the best number of clusters and compare your result to the distribution of the original data. Try to take the extra step of generating a scatter plot using your visualization knowledge.

In Unsupervised Machine Learning, many clustering algorithms are used to group objects for analysis and finding patterns. One commonly known technique is Agglomerative Clustering, where objects that are close to each other are placed in one group. In the beginning, all objects are single clusters (leaves) , and the algorithm keeps on clustering objects until a single cluster (root) remains. Clustering forms a tree-like structure called a dendrogram.

Agglomerative cluster is a common type of Hierarchical Clustering that is also called Agglomerative Nesting (AGNES). It follows a bottom-up approach while clustering objects.


Algorithm steps

The algorithm includes the following steps:

  • Find distances between all clusters.
  • Group a pair of clusters with the minimum distance into a new cluster.
  • Find the distances between clusters, including this new cluster. Keep on clustering until one single cluster is left.


Method representation


Clustered Objects { }


Clustered Objects { }

Clustered objects {E,F}


Clustered objects {E,F}


Clustered Objects {A,B}

Clustered Objects {A,B}

Clustered Objects {{E,F} , {D}}

Clustered Objects {{E,F} , {D}}

Clustered Objects { { {E,F},{D} } , {C} }

Clustered Objects { { {E,F},{D} } , {C} }

Clustered Objects { { { { E,F }, {D} } , { C } } , { A, B } }
Clustered Objects { { { { E,F }, {D} } , { C } } , { A, B } }



A dendrogram is used to represent the hierarchical relationship between objects. The height of the dendrogram shows the order in which objects are clustered together. In the above example, the first cluster is formed by the closest objects, i.e., E and F. So, the blue-colored link merging them together is of the shortest height. Afterward, objects A and B are merged using a pink-colored link that is of the second shortest height.


Distance measurement

The distance between clusters can be measured using various techniques, including Euclidean, Manhattan, Minkowski, etc. The most commonly used technique is Euclidean Distance. The distance between the two points (x1,y1) and (x2,y2) can be found using the Euclidean formula:
d = sqrt((x1-x2)^2 + (y1-y2)^2). The formula is further explained by the example given below:

Euclidean Distance Example
Euclidean Distance Example


Agglomerative clustering with Scikit-Learn

Firstly, import all the required libraries. Then, generate a 2D array of all the data points with their coordinates array. After you initialize the Agglomerative Clustering model, call the fit method on it. Lastly, plot the dendrogram to see the clustering results.

The Agglomerative function takes distance threshold and n_clusters as parameters.

distance threshold is the linkage distance threshold above which clusters will not be merged, and it shows the limit at which to cut the dendrogram tree.

n_clusters shows the number of clusters to find.

# import libraries
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram

#  generate coordinates array for six samples
X = np.array([[1.3, 4.8], [2.3, 5.5], [3.6, 1.3], [6.1, 5.1], [6.2, 2.5], [6.7, 3.4]])

# instantiate Agglomerative Clustering instance
clustering_model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)

# call fit method with array of sample coordinates passed as a parameter
trained_model = clustering_model.fit(X)


# A method for generating dendrogram
def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_, counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)


# plot dendrogram to visualize clusters
plot_dendrogram(trained_model)

Clustering Result

Clustering Result

Note: To run the above code cell, use sklearn version >= 0.22


Source: Eman Ijaz, https://www.educative.io/answers/what-is-agglomerative-clustering
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

Last modified: Tuesday, September 27, 2022, 6:54 PM