Topic: Unit 8: Data Mining II – Clustering Techniques | CS250: Python for Data Science | Saylor Academy

This unit extends the material in the previous unit to clustering techniques, which are useful for creating pattern classification models where the input classes are unknown (which we call unsupervised learning). When you finish this unit, you will be able to create programs capable of training and testing unsupervised learning models. As in the previous unit, we will implement these techniques using the scikit-learn module.

The clustering of input feature vectors can be accomplished in several different ways. This unit focuses on two techniques: K-means, which requires some knowledge of the number of classes, and hierarchical clustering, which allows the input data to gradually define the number of classes. Both methodologies have their place within the field of data science.

Completing this unit should take you approximately 5 hours.

Select activity Upon successful completion of this unit, you will ...
Upon successful completion of this unit, you will be able to:

explain unsupervised learning concepts;

apply methods in the scikit-learn module to perform unsupervised learning;

explain similarities and differences between hierarchical clustering and K-means clustering; and

validate models using clustering techniques.

8.1: Unsupervised Learning
- Select activity Unsupervised Learning
  
  Unsupervised Learning Page
  
  Students must
  
  Mark as done
  
  Now that you have had a chance to understand and implement supervised data mining techniques, we can move on to unsupervised techniques. Unsupervised learning assumes no labels for training observations. We let the data tell us what its classification should be. This can be done using many approaches, but we will focus on clustering techniques in this course.
- Select activity More on Unsupervised Learning
  
  More on Unsupervised Learning Page
  
  Students must
  
  Mark as done
  
  We will continue to use scikit-learn for implementations. As you can see, there are several methods contained within the module. This unit will focus on K-means and agglomerative clustering. Follow along with the code for implementing these methods and begin to get used to the new syntax. As the next sections unfold, the meaning of the instructions related to clustering will become clearer.
8.2: K-means Clustering
- Select activity K-means Clustering
  
  K-means Clustering Page
  
  Students must
  
  Mark as done
  
  The K-means algorithm attempts to optimally partition a data set into K clusters. The number of clusters, K, must be input to the algorithm (although many variations exist that attempt to estimate K). The main concept to grasp is the centroid of a set of training vectors. Assuming each training vector contains d features (that is, d-dimensional training vectors), a mean vector (or "centroid") for a set of vectors can be formed by computing the empirical mean of each component separately. This is how you generalize from computing the mean using scalar data versus vector data.
- Select activity More on K-means Clustering
  
  More on K-means Clustering Page
  
  Students must
  
  Mark as done
  
  You are now in a position to draw a direct line between the algorithm and its associated Python implementation. This particular example creates a training set using random data so that it becomes obvious how the algorithm works.
- Select activity Implementing K-means Clustering
  
  Implementing K-means Clustering Page
  
  Students must
  
  Mark as done
  
  This tutorial is an excellent exercise for your Python coding skills because it shows how to implement the K-means algorithm from scratch and then implement it using scikit-learn. Additionally, you will also learn how to evaluate clustering performance as a function of the parameter K. This is an important new step because the number of clusters is the biggest unknown behind this algorithm.
- Select activity Interpreting the Results of Clustering
  
  Interpreting the Results of Clustering Page
  
  Students must
  
  Mark as done
  
  Here is an example of applying K-means to cluster customer data. Study the code in depth to learn how to use visualization for interpreting the clustering results.
- Select activity PCA and Clustering
  
  PCA and Clustering Book
  
  Students must
  
  Mark as done
  
  This tutorial introduces examples, including the analysis of handwritten digits, and then applies PCA to reduce the dimensionality of the data set. Observe how it connects with programming concepts introduced in the previous unit dealing with PCA.
8.3: Hierarchical Clustering
- Select activity Hierarchical Clustering
  
  Hierarchical Clustering Page
  
  Students must
  
  Mark as done
  
  In this section, you will learn about hierarchical clustering and, in particular, agglomerative clustering. In contrast to K-means, this methodology does not require you to know the number of clusters in advance. This information is generated from a dendrogram created by the algorithm. As clusters of points are created, notions of the distance between two sets (that is, the "linkage') must be understood when applying this algorithm. You should already know how to compute the Euclidean distance between two points. This article also points out that there are many ways to compute the distance between points (Manhattan, maximum, Mahalanobois, etc.). We can also use these functions for point distances to compute the distance between two sets of points. For example, single linkage computes set distances by choosing the two closest points. Complete linkage chooses the two most distant points. Average distance computes the average of all distances between all points from both sets and so on. Read through this article to get an overview of hierarchical clustering.
- Select activity Hierarchical Clustering Using Trees
  
  Hierarchical Clustering Using Trees Page
  
  Students must
  
  Mark as done
  
  Here is a visual introduction to hierarchical clustering that walks you through a practical example.
- Select activity Agglomerative Clustering
  
  Agglomerative Clustering Page
  
  Students must
  
  Mark as done
  
  Work through this example to draw a line from the agglomerative clustering algorithm to its equivalent Python implementation using scikit-learn. Pay attention to how the data sets are created and how they relate to each iteration as the clusters gradually form. Use the dendrogram to determine the best number of clusters and compare your result to the distribution of the original data. Try to take the extra step of generating a scatter plot using your visualization knowledge.
- Select activity Applying Clustering
  
  Applying Clustering Book
  
  Students must
  
  Mark as done
  
  This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clutering performance.
- Select activity Comparing Aggomerative and K-means Clustering
  
  Comparing Aggomerative and K-means Clustering Page
  
  Students must
  
  Mark as done
  
  This section continues the example presented in the previous section on K-means. In addition to discussing code for implementing agglomerative clustering, it also includes applications of various accuracy measures useful for analyzing clustering performance.
8.4: Training and Testing
- Select activity Clustering with scikit-learn
  
  Clustering with scikit-learn Page
  
  Students must
  
  Mark as done
  
  It is time to put together concepts from this and the previous unit. This tutorial uses k-NN as the classifier, given clustering results from the K-means algorithm. In essence, an unsupervised method is used as the input to a method that requires supervised data.
- Select activity Putting It All Together
  
  Putting It All Together Page
  
  Students must
  
  Mark as done
  
  This tutorial is a culminating project that combines the concepts (clustering, dimensionality reduction, cluster evaluation, and visualization) presented in this unit. Work through the programming examples to ensure you have a complete understanding.
Unit 8 Assessment
- Select activity Unit 8 Assessment
  Unit 8 Assessment Quiz
  
  Students must
  
  Receive a grade
  
  Take this assessment to see how well you understood this unit.
  
  This assessment does not count towards your grade. It is just for practice!
  
  You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
  
  You can take this assessment as many times as you want, whenever you want.

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python