Topic outline

  • Unit 8: Data Mining II – Clustering Techniques

    This unit extends the material in the previous unit to clustering techniques, which are useful for creating pattern classification models where the input classes are unknown (which we call unsupervised learning). When you finish this unit, you will be able to create programs capable of training and testing unsupervised learning models. As in the previous unit, we will implement these techniques using the scikit-learn module.

    The clustering of input feature vectors can be accomplished in several different ways. This unit focuses on two techniques: K-means, which requires some knowledge of the number of classes, and hierarchical clustering, which allows the input data to gradually define the number of classes. Both methodologies have their place within the field of data science.

    Completing this unit should take you approximately 5 hours.

    • Upon successful completion of this unit, you will be able to:

      • explain unsupervised learning concepts;
      • apply methods in the scikit-learn module to perform unsupervised learning;
      • explain similarities and differences between hierarchical clustering and K-means clustering; and
      • validate models using clustering techniques.
    • 8.1: Unsupervised Learning

      • Now that you have had a chance to understand and implement supervised data mining techniques, we can move on to unsupervised techniques. Unsupervised learning assumes no labels for training observations. We let the data tell us what its classification should be. This can be done using many approaches, but we will focus on clustering techniques in this course.

      • We will continue to use scikit-learn for implementations. As you can see, there are several methods contained within the module. This unit will focus on K-means and agglomerative clustering. Follow along with the code for implementing these methods and begin to get used to the new syntax. As the next sections unfold, the meaning of the instructions related to clustering will become clearer.

    • 8.2: K-means Clustering

      • The K-means algorithm attempts to optimally partition a data set into K clusters. The number of clusters, K, must be input to the algorithm (although many variations exist that attempt to estimate K). The main concept to grasp is the centroid of a set of training vectors. Assuming each training vector contains d features (that is, d-dimensional training vectors), a mean vector (or "centroid") for a set of vectors can be formed by computing the empirical mean of each component separately. This is how you generalize from computing the mean using scalar data versus vector data.

      • You are now in a position to draw a direct line between the algorithm and its associated Python implementation. This particular example creates a training set using random data so that it becomes obvious how the algorithm works.

      • This tutorial is an excellent exercise for your Python coding skills because it shows how to implement the K-means algorithm from scratch and then implement it using scikit-learn. Additionally, you will also learn how to evaluate clustering performance as a function of the parameter K. This is an important new step because the number of clusters is the biggest unknown behind this algorithm.

      • Here is an example of applying K-means to cluster customer data. Study the code in depth to learn how to use visualization for interpreting the clustering results.

      • This tutorial introduces examples, including the analysis of handwritten digits, and then applies PCA to reduce the dimensionality of the data set. Observe how it connects with programming concepts introduced in the previous unit dealing with PCA.

    • 8.3: Hierarchical Clustering

      • In this section, you will learn about hierarchical clustering and, in particular, agglomerative clustering. In contrast to K-means, this methodology does not require you to know the number of clusters in advance. This information is generated from a dendrogram created by the algorithm. As clusters of points are created, notions of the distance between two sets (that is, the "linkage') must be understood when applying this algorithm. You should already know how to compute the Euclidean distance between two points. This article also points out that there are many ways to compute the distance between points (Manhattan, maximum, Mahalanobois, etc.). We can also use these functions for point distances to compute the distance between two sets of points. For example, single linkage computes set distances by choosing the two closest points. Complete linkage chooses the two most distant points. Average distance computes the average of all distances between all points from both sets and so on. Read through this article to get an overview of hierarchical clustering.

      • Here is a visual introduction to hierarchical clustering that walks you through a practical example.

      • Work through this example to draw a line from the agglomerative clustering algorithm to its equivalent Python implementation using scikit-learn. Pay attention to how the data sets are created and how they relate to each iteration as the clusters gradually form. Use the dendrogram to determine the best number of clusters and compare your result to the distribution of the original data. Try to take the extra step of generating a scatter plot using your visualization knowledge.

      • This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clutering performance. 

      • This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clustering performance.

    • 8.4: Training and Testing

      • It is time to put together concepts from this and the previous unit. This tutorial uses k-NN as the classifier, given clustering results from the K-means algorithm. In essence, an unsupervised method is used as the input to a method that requires supervised data.

      • This tutorial is a culminating project that combines the concepts (clustering, dimensionality reduction, cluster evaluation, and visualization) presented in this unit. Work through the programming examples to ensure you have a complete understanding.

    • Unit 8 Assessment

      • Take this assessment to see how well you understood this unit.

        • This assessment does not count towards your grade. It is just for practice!
        • You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
        • You can take this assessment as many times as you want, whenever you want.