8.2: K-means Clustering
The K-means algorithm attempts to optimally partition a data set into K clusters. The number of clusters, K, must be input to the algorithm (although many variations exist that attempt to estimate K). The main concept to grasp is the centroid of a set of training vectors. Assuming each training vector contains d features (that is, d-dimensional training vectors), a mean vector (or "centroid") for a set of vectors can be formed by computing the empirical mean of each component separately. This is how you generalize from computing the mean using scalar data versus vector data.
You are now in a position to draw a direct line between the algorithm and its associated Python implementation. This particular example creates a training set using random data so that it becomes obvious how the algorithm works.
This tutorial is an excellent exercise for your Python coding skills because it shows how to implement the K-means algorithm from scratch and then implement it using scikit-learn. Additionally, you will also learn how to evaluate clustering performance as a function of the parameter K. This is an important new step because the number of clusters is the biggest unknown behind this algorithm.
Here is an example of applying K-means to cluster customer data. Study the code in depth to learn how to use visualization for interpreting the clustering results.
This tutorial introduces examples, including the analysis of handwritten digits, and then applies PCA to reduce the dimensionality of the data set. Observe how it connects with programming concepts introduced in the previous unit dealing with PCA.