This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clutering performance.
Choosing the Number of Clusters for K-means
In contrast to supervised learning tasks such as classification and regression, clustering requires more effort to choose the optimization criterion. Usually, when working with k-means, we optimize the sum of squared distances between the observations and their centroids.
where – is a set of clusters with power
,
is a centroid of a cluster
.
This definition seems reasonable -- we want our observations to be as close to their centroids as possible. But, there is a problem -- the optimum is reached when the number of centroids is equal to the number of observations, so you would end up with every single observation as its own separate cluster.
In order to avoid that case, we should choose a number of clusters after which a function is decreasing less rapidly. More formally,
Let's look at an example.
from sklearn.cluster import KMeans
inertia = [] for k in range(1, 8): kmeans = KMeans(n_clusters=k, random_state=1).fit(X) inertia.append(np.sqrt(kmeans.inertia_))
plt.plot(range(1, 8), inertia, marker='s'); plt.xlabel('$k$') plt.ylabel('$J(C_k)$');

We see that decreases significantly until the number of clusters is 3 and then does not change as much anymore. This means that the optimal number of clusters is 3.
Issues
Inherently, K-means is NP-hard. For dimensions,
clusters, and
observations, we will find a solution in
in time. There are some heuristics to deal with this; an example is MiniBatch K-means, which takes portions (batches) of data instead of fitting the whole dataset and then moves centroids by taking the average of the previous steps. Compare the implementation of K-means and MiniBatch K-means in the sckit-learn documentation.
The implementation of the algorithm using scikit-learn
has its benefits, such as the possibility to state the number of initializations with the n_init
function parameter, which enables us to identify more robust centroids. Moreover, these runs can be done in parallel to decrease the computation time.