CS250 Study Guide

Unit 8: Data Mining II – Clustering Techniques

8a. Explain unsupervised learning concepts 

  • What distinguishes unsupervised learning from supervised learning?
  • What kind of applications are well suited for unsupervised learning?
  • What is the main goal of a clustering algorithm?

There exist classification problems where a given training set consists only of input data without any classification labels (or "output targets") associated with the input data. Under these circumstances, algorithms are required to partition the input data into subsets that will hopefully reveal important aspects of the data. Whereas supervised learning algorithms require preassigned targets, unsupervised learning algorithms do not. For example, recognizing images of license plate numbers or distinguishing between images of apples versus oranges would be considered supervised learning problems because the classes are known. On the other hand, the classification of elements within images of natural scenery would be an example of an unsupervised learning problem. This is because new elements not previously categorized could arise within an image.

Clustering algorithms offer a highly successful approach to solving unsupervised classification problems. They enable the data to tell you what the categories are in a principled way (and not the reverse). Hence, a major theme throughout all clustering techniques is their attempt to group similar data points into similar subsets. Under ideal circumstances, the derived clusters will have large intra-set and small inter-set distances. Some clustering techniques require the computation of a centroid (or "mean vector"), a vector whose components are the empirical mean of each component (or "variable") computed across all observations. Finally, since clustering algorithms are objective and unbiased, one potential drawback is that the clusters arrived at may not correspond to definable targets whose meaning can be interpreted.

To review, see Unsupervised Learning.

 

8b. Apply methods in the scikit-learn module to perform unsupervised learning

  • What are the main clustering algorithms covered in this course?
  • What is the syntax for invoking these clustering algorithms in scikit-learn?
  • What input parameters and output attributes are applicable to these algorithms?

While many clustering approaches exist, two of the most well-known algorithms are K-means clustering and agglomerative clustering. Objects for implementing them can be instantiated using scikit-learn by invoking cluster.KMeans and cluster.AgglomerativeClustering. The n_clusters input parameter specifies the number of clusters to be computed. After applying the fit method, cluster memberships of the training data can be determined using the labels_ output attribute.

The K-means algorithm requires the number of clusters as an input. If the number of clusters is unknown, then it is common to test a series of cluster counts to numerically determine the optimal cluster composition. The inertia_ attribute is useful for this purpose as it measures the sum of squared distances of observations to their closest cluster center. The agglomerative clustering algorithm allows for the number of clusters as an input parameter. The intra-set distance can be set using the linkage parameter, which allows for typical choices such as 'ward', 'complete', 'average', or 'single', where the 'ward' linkage is the default value. The affinity parameter can be used to choose the distance metric, which can be set to typical values such as 'euclidean', 'manhattan', 'l1', or 'l2'. It also allows for precomputed metrics. However, if Ward linkage has been chosen, the affinity parameter must be set to 'euclidean'.

To review, see K-means Clustering and Hierarchical Clustering.

 

8c. Explain similarities and differences between hierarchical clustering and K-means clustering 

  • How is K-means clustering similar to agglomerative clustering?
  • How does K-means clustering differ from agglomerative clustering?
  • How do these differences translate into application and syntax?

Both K-means and agglomerative clustering algorithms are distance-based techniques. K-means clustering is an iterative algorithm that uses point distances to define cluster membership, where clusters are defined by their centroids. Agglomerative clustering uses set distances (linkages) as the criteria for merging clusters. Furthermore, the agglomerative clustering algorithm constructs a tree (or dendrogram) as more clusters are joined using a distance matrix.

The most generic forms of the K-means algorithm require specifying the number of clusters. You should be familiar with the inertia_ input parameter for helping to determine a sensible cluster number. The agglomerative clustering algorithm allows for the number of clusters as an input; however, it can also be used to determine the number of clusters. This can be done, for example, by computing the full tree (by setting the compute_full_tree parameter to a value of True). The output distances_ can then be used to determine a distance_threshold so that a cutoff can be applied to determine the number of clusters. The associated dendrogram can help visualize the cutoff point.

To review, see Hierarchical Clustering.

 

8d. Validate models using clustering techniques 

  • How can clustering methods be applied to make meaningful predictions?
  • Which clustering technique allows for the use of the predict method?
  • How can supervised learning play a role after a set of clusters have been determined?

After applying a clustering technique, the silhouette_score from sklearn.metrics can compare the mean intra-cluster distance and the mean nearest-cluster distance. This allows for a principled way of evaluating the clusters. In general, once the set of clusters has been decided upon and labels have been assigned (that is, once the model has been fit), the unsupervised problem has been transformed into a supervised classification problem. Hence you should be comfortable applying any supervised learning method introduced in this course as a classification scheme for the derived clusters.

The K-means class allows for the use of the predict method where cluster membership for a given test vector is decided by choosing the closest centroid. This effectively reduces the class membership problem to a 1-nearest neighbor problem. Agglomerative clustering does not allow for the use of the predict method. However, you should feel comfortable taking the clusters and, for example, computing a centroid vector. Additionally, you should be prepared to apply the k-nearest neighbor algorithm using a set of clusters generated by either K-means or agglomerative clustering.

To review, see Training and Testing.

 

Unit 8 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • affinity
  • centroid
  • cluster.AgglomerativeClustering
  • cluster.KMeans
  • clustering
  • compute_full_tree
  • dendrogram
  • distance matrix
  • distances_
  • distance_threshold
  • inertia_
  • labels_
  • linkage
  • n_clusters
  • silhouette_score
  • unsupervised classification