Applying Clustering

Accuracy Metrics

As opposed to classification, it is difficult to assess the quality of results from clustering. Here, a metric cannot depend on the labels but only on the goodness of the split. Secondly, we do not usually have true labels of the observations when we use clustering.

There are internal and external goodness metrics. External metrics use the information about the known true split, while internal metrics do not use any external information and assess the goodness of clusters based only on the initial data. The optimal number of clusters is usually defined with respect to some internal metrics.

All the metrics described below are implemented in sklearn.metrics.

Adjusted Rand Index (ARI)

Here, we assume that the true labels of objects are known. This metric does not depend on the labels' values but on the data cluster split. Let N be the number of observations in a sample. Let a be the number of observation pairs with the same labels and located in the same cluster, and let b be the number of observations with different labels and located in different clusters. The Rand Index can be calculated using the following formula:

\Large \text{RI} = \frac{2(a + b)}{n(n-1)}.

In other words, it evaluates a share of observations for which these splits (initial and clustering results) are consistent. The Rand Index (RI) evaluates the similarity of the two splits of the same sample. In order for this index to be close to zero for any clustering outcomes with any n and the number of clusters, it is essential to scale it, hence the Adjusted Rand Index:

\Large \text{ARI} = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]}.

This metric is symmetric and does not depend in the label permutation. Therefore, this index is a measure of distances between different sample splits. ARI takes on values in the [−1,1] range. Negative values indicate the independence of splits, and positive values indicate that these splits are consistent (they match ARI=1).

Adjusted Mutual Information (AMI)

This metric is similar to ARI. It is also symmetric and does not depend on the labels' values and permutation. It is defined by the entropy function and interprets a sample split as a discrete distribution (the likelihood of assigning to a cluster is equal to the percent of objects in it). The MI index is defined as the mutual information for two distributions corresponding to the sample split into clusters. Intuitively, the mutual information measures the share of information common for both clustering splits i.e. how information about one of them decreases the uncertainty of the other one.

Similarly to the ARI, the AMI is defined. This allows us to get rid of the MI index's increase with the number of clusters. The AMI lies in the [0,1] range. Values close to zero mean the splits are independent, and those close to 1 mean they are similar (with a complete match at AMI=1).

Homogeneity, completeness, V-measure

Formally, these metrics are also defined based on the entropy function and the conditional entropy function, interpreting the sample splits as discrete distributions:

\Large h = 1 - \frac{H(C\mid K)}{H(C)}, c = 1 - \frac{H(K\mid C)}{H(K)},

where K is a clustering result, and C is the initial split. Therefore, h evaluates whether each cluster is composed of the same class objects, and c measures how well the same class objects fit the clusters. These metrics are not symmetric. Both lie in the [0,1] range, and values closer to 1 indicate more accurate clustering results. These metrics' values are not scaled as the ARI or AMI metrics are and thus depend on the number of clusters. A random clustering result will not have metric values closer to zero when the number of clusters is big enough, and the number of objects is small. In such a case, it would be more reasonable to use ARI. However, with a large number of observations (more than 100) and the number of clusters less than 10, this issue is less critical and can be ignored.

V-measure is a combination of h, and c and is their harmonic mean:

\Large v = 2\frac{hc}{h+c}.

 It is symmetric and measures how consistent two clustering results are.


In contrast to the metrics described above, this coefficient does not imply knowledge about the true labels of the objects. It lets us estimate the quality of the clustering using only the initial, unlabeled sample and the clustering result. To start with, for each observation, the silhouette coefficient is computed. Let a be the mean of the distance between an object and other objects within one cluster and b be the mean distance from an object to an object from the nearest cluster (different from the one the object belongs to). Then the silhouette measure for this object is

\Large s = \frac{b - a}{\max(a, b)}.

The silhouette of a sample is a mean value of silhouette values from this sample. Therefore, the silhouette distance shows to which extent the distance between the objects of the same class differs from the mean distance between the objects from different clusters. This coefficient takes values in the [−1,1] range. Values close to -1 correspond to bad clustering results, while values closer to 1 correspond to dense, well-defined clusters. Therefore, the higher the silhouette value is, the better the results from clustering.

With the help of silhouette, we can identify the optimal number of clusters k (if we don't know it already from the data) by taking the number of clusters that maximizes the silhouette coefficient.

To conclude, let's take a look at how these metrics perform with the MNIST handwritten numbers dataset:

from sklearn import metrics
from sklearn import datasets
import pandas as pd
from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation, SpectralClustering

data = datasets.load_digits()
X, y =,

algorithms = []
algorithms.append(KMeans(n_clusters=10, random_state=1))
algorithms.append(SpectralClustering(n_clusters=10, random_state=1,

data = []
for algo in algorithms:
        'ARI': metrics.adjusted_rand_score(y, algo.labels_),
        'AMI': metrics.adjusted_mutual_info_score(y, algo.labels_,
        'Homogenity': metrics.homogeneity_score(y, algo.labels_),
        'Completeness': metrics.completeness_score(y, algo.labels_),
        'V-measure': metrics.v_measure_score(y, algo.labels_),
        'Silhouette': metrics.silhouette_score(X, algo.labels_)}))

results = pd.DataFrame(data=data, columns=['ARI', 'AMI', 'Homogenity',
                                           'Completeness', 'V-measure', 
                       index=['K-means', 'Affinity', 
                              'Spectral', 'Agglomerative'])


ARI AMI Homogeneity Completeness V-measure Silhouette
K-means 0.662295 0.736567 0.735448 0.742972 0.739191 0.182097
Affinity 0.175174 0.612460 0.958907 0.486901 0.645857 0.115197
Spectral 0.756461 0.852040 0.831691 0.876614 0.853562 0.182729
Agglomerative 0.794003 0.866832 0.857513 0.879096 0.868170