CS250: Putting It All Together

Introduction

The main feature of unsupervised learning algorithms, when compared to classification and regression methods, is that input data are unlabeled (i.e. no labels or classes are given) and that the algorithm learns the structure of the data without any assistance. This creates two main differences. First, it allows us to process large amounts of data because the data does not need to be manually labeled. Second, it is difficult to evaluate the quality of an unsupervised algorithm due to the absence of an explicit goodness metric as used in supervised learning.

One of the most common tasks in unsupervised learning is dimensionality reduction. On one hand, dimensionality reduction may help with data visualization (e.g. t-SNA method) while, on the other hand, it may help deal with the multicollinearity of your data and prepare the data for a supervised learning method (e.g. decision trees).

Intuition, theories, and application issues

Principal Component Analysis is one of the easiest, most intuitive, and most frequently used methods for dimensionality reduction, projecting data onto its orthogonal feature subspace.

More generally speaking, all observations can be considered as an ellipsoid in a subspace of an initial feature space, and the new basis set in this subspace is aligned with the ellipsoid axes. This assumption lets us remove highly correlated features since basis set vectors are orthogonal. In the general case, the resulting ellipsoid dimensionality matches the initial space dimensionality, but the assumption that our data lies in a subspace with a smaller dimension allows us to cut off the "excessive" space with the new projection (subspace). We accomplish this in a 'greedy' fashion, sequentially selecting each of the ellipsoid axes by identifying where the dispersion is maximal.

"To deal with hyper-planes in a 14 dimensional space, visualize a 3D space and say 'fourteen' very loudly. Everyone does it." - Geoffrey Hinton

Let's take a look at the mathematical formulation of this process:

In order to decrease the dimensionality of our data from \(n\) to \(k\) with \(k \leq n\), we sort our list of axes in order of decreasing dispersion and take the top-\(k\) of them.

We begin by computing the dispersion and the covariance of the initial features. This is usually done with the covariance matrix. According to the covariance definition, the covariance of two features is computed as follows: \(cov(X_i, X_j) = E[(X_i - \mu_i) (X_j - \mu_j)] = E[X_i X_j] - \mu_i \mu_j\), where \(\mu_i\) is the expected value of the \(i\)th feature. It is worth noting that the covariance is symmetric, and the covariance of a vector with itself is equal to its dispersion.

Therefore the covariance matrix is symmetric with the dispersion of the corresponding features on the diagonal. Non-diagonal values are the covariances of the corresponding pair of features. In terms of matrices where \(\mathbf{X}\) is the matrix of observations, the covariance matrix is as follows:

\(\Sigma = E[(\mathbf{X} - E[\mathbf{X}]) (\mathbf{X} - E[\mathbf{X}])^{T}]\)

Quick recap: matrices, as linear operators, have eigenvalues and eigenvectors. They are very convenient because they describe parts of our space that do not rotate and only stretch when we apply linear operators on them; eigenvectors remain in the same direction but are stretched by a corresponding eigenvalue. Formally, a matrix \(M\) with eigenvector \(w_i\) and eigenvalue \(\lambda_i\) satisfy this equation: \(M w_i = \lambda_i w_i\).

The covariance matrix for a sample \(\mathbf{X}\) can be written as a product of \(\mathbf{X}^{T} \mathbf{X}\). According to the Rayleigh quotient, the maximum variation of our sample lies along the eigenvector of this matrix and is consistent with the maximum eigenvalue. Therefore, the principal components we aim to retain from the data are just the eigenvectors corresponding to the top-\(k\) largest eigenvalues of the matrix.

The next steps are easier to digest. We multiply the matrix of our data \(X\) by these components to get the projection of our data onto the orthogonal basis of the chosen components. If the number of components was smaller than the initial space dimensionality, remember that we will lose some information upon applying this transformation.

Examples

Fisher's iris dataset

Let's start by uploading all of the essential modules and try out the iris example from the scikit-learn documentation.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style='white')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from sklearn import decomposition
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D

# Loading the dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Let's create a beautiful 3d-plot
fig = plt.figure(1, figsize=(6, 5))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Change the order of labels, so that they match
y_clr = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y_clr, 
           cmap=plt.cm.nipy_spectral)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([]);

Now let's see how PCA will improve the results of a simple model that is not able to correctly fit all of the training data:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

# Train, test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, 
                                                    stratify=y, 
                                                    random_state=42)

# Decision trees with depth = 2
clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict_proba(X_test)
print('Accuracy: {:.5f}'.format(accuracy_score(y_test, 
                                                preds.argmax(axis=1))))

Accuracy: 0.88889

Let's try this again, but, this time, let's reduce the dimensionality to 2 dimensions:

# Using PCA from sklearn PCA
pca = decomposition.PCA(n_components=2)
X_centered = X - X.mean(axis=0)
pca.fit(X_centered)
X_pca = pca.transform(X_centered)

# Plotting the results of PCA
plt.plot(X_pca[y == 0, 0], X_pca[y == 0, 1], 'bo', label='Setosa')
plt.plot(X_pca[y == 1, 0], X_pca[y == 1, 1], 'go', label='Versicolour')
plt.plot(X_pca[y == 2, 0], X_pca[y == 2, 1], 'ro', label='Virginica')
plt.legend(loc=0);

# Test-train split and apply PCA
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=.3, 
                                                    stratify=y, 
                                                    random_state=42)

clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict_proba(X_test)
print('Accuracy: {:.5f}'.format(accuracy_score(y_test, 
                                                preds.argmax(axis=1))))

Accuracy: 0.91111

The accuracy did not increase significantly in this case, but, with other datasets with a high number of dimensions, PCA can drastically improve the accuracy of decision trees and other ensemble methods.

Now let's check out the percent of variance that can be explained by each of the selected components.

for i, component in enumerate(pca.components_):
    print("{} component: {}% of initial variance".format(i + 1, 
          round(100 * pca.explained_variance_ratio_[i], 2)))
    print(" + ".join("%.3f x %s" % (value, name)
                     for value, name in zip(component,
                                            iris.feature_names)))

1 component: 92.46% of initial variance
0.361 x sepal length (cm) + -0.085 x sepal width (cm) + 0.857 x petal length (cm) + 0.358 x petal width (cm)
2 component: 5.31% of initial variance
0.657 x sepal length (cm) + 0.730 x sepal width (cm) + -0.173 x petal length (cm) + -0.075 x petal width (cm)

Handwritten numbers dataset

Let's look at the handwritten numbers dataset that we used before in the 3rd lesson.

digits = datasets.load_digits()
X = digits.data
y = digits.target

Let's start by visualizing our data. Fetch the first 10 numbers. The numbers are represented by 8 x 8 matrixes with the color intensity for each pixel. Every matrix is flattened into a vector of 64 numbers, so we get the feature version of the data.

# f, axes = plt.subplots(5, 2, sharey=True, figsize=(16,6))
plt.figure(figsize=(16, 6))
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(X[i,:].reshape([8,8]), cmap='gray');

Our data has 64 dimensions, but we are going to reduce it to only 2 and see that, even with just 2 dimensions, we can clearly see that digits separate into clusters.

pca = decomposition.PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print('Projecting %d-dimensional data to 2D' % X.shape[1])

plt.figure(figsize=(12,10))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, 
            edgecolor='none', alpha=0.7, s=40,
            cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar()
plt.title('MNIST. PCA projection');

Projecting 64-dimensional data to 2D

Source: Sergey Korolev, https://web.archive.org/web/20210804210309/https://mlcourse.ai/articles/topic7-unsupervised/
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.

Last modified: Wednesday, 28 September 2022, 4:25 PM

Course Introduction

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python