This tutorial is a culminating project that combines the concepts (clustering, dimensionality reduction, cluster evaluation, and visualization) presented in this unit. Work through the programming examples to ensure you have a complete understanding.
Introduction
The main feature of unsupervised learning algorithms, when compared to classification and regression methods, is that input data are unlabeled (i.e. no labels or classes are given) and that the algorithm learns the structure of the data without any assistance. This creates two main differences. First, it allows us to process large amounts of data because the data does not need to be manually labeled. Second, it is difficult to evaluate the quality of an unsupervised algorithm due to the absence of an explicit goodness metric as used in supervised learning.
One of the most common tasks in unsupervised learning is dimensionality reduction. On one hand, dimensionality reduction may help with data visualization (e.g. t-SNA method) while, on the other hand, it may help deal with the multicollinearity of your data and prepare the data for a supervised learning method (e.g. decision trees).
Intuition, theories, and application issues
Principal Component Analysis is one of the easiest, most intuitive, and most frequently used methods for dimensionality reduction, projecting data onto its orthogonal feature subspace.
More generally speaking, all observations can be considered as an ellipsoid in a subspace of an initial feature space, and the new basis set in this subspace is aligned with the ellipsoid axes. This assumption lets us remove highly correlated features since basis set vectors are orthogonal. In the general case, the resulting ellipsoid dimensionality matches the initial space dimensionality, but the assumption that our data lies in a subspace with a smaller dimension allows us to cut off the "excessive" space with the new projection (subspace). We accomplish this in a 'greedy' fashion, sequentially selecting each of the ellipsoid axes by identifying where the dispersion is maximal.
"To deal with hyper-planes in a 14 dimensional space, visualize a 3D space and say 'fourteen' very loudly. Everyone does it." - Geoffrey Hinton
Let's take a look at the mathematical formulation of this process:
In order to decrease the dimensionality of our data from to
with
, we sort our list of axes in order of decreasing
dispersion and take the top-
of them.
We begin by computing the dispersion and the covariance of the
initial features. This is usually done with the covariance matrix.
According to the covariance definition, the covariance of two features
is computed as follows: , where
is the expected value of
the
th feature. It is worth noting that the covariance is symmetric,
and the covariance of a vector with itself is equal to its dispersion.
Therefore the covariance matrix is symmetric with the dispersion of
the corresponding features on the diagonal. Non-diagonal values are the
covariances of the corresponding pair of features. In terms of matrices
where is the matrix of observations, the covariance matrix
is as follows:
Quick
recap: matrices, as linear operators, have eigenvalues and
eigenvectors. They are very convenient because they describe parts of
our space that do not rotate and only stretch when we apply linear
operators on them; eigenvectors remain in the same direction but are
stretched by a corresponding eigenvalue. Formally, a matrix with
eigenvector
and eigenvalue
satisfy this equation:
.
The covariance matrix for a sample can be written as a product of
. According to the Rayleigh quotient,
the maximum variation of our sample lies along the eigenvector of this
matrix and is consistent with the maximum eigenvalue. Therefore, the
principal components we aim to retain from the data are just the
eigenvectors corresponding to the top-
largest eigenvalues of the
matrix.
The next steps are easier to digest. We multiply the matrix of our
data by these components to get the projection of our data onto the
orthogonal basis of the chosen components. If the number of components
was smaller than the initial space dimensionality, remember that we will
lose some information upon applying this transformation.
Examples
Fisher's iris dataset
Let's start by uploading all of the essential modules and try out the iris example from the scikit-learn
documentation.
import numpy as np import matplotlib.pyplot as plt import seaborn as sns; sns.set(style='white') %matplotlib inline %config InlineBackend.figure_format = 'retina' from sklearn import decomposition from sklearn import datasets from mpl_toolkits.mplot3d import Axes3D # Loading the dataset iris = datasets.load_iris() X = iris.data y = iris.target # Let's create a beautiful 3d-plot fig = plt.figure(1, figsize=(6, 5)) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]: ax.text3D(X[y == label, 0].mean(), X[y == label, 1].mean() + 1.5, X[y == label, 2].mean(), name, horizontalalignment='center', bbox=dict(alpha=.5, edgecolor='w', facecolor='w')) # Change the order of labels, so that they match y_clr = np.choose(y, [1, 2, 0]).astype(np.float) ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y_clr, cmap=plt.cm.nipy_spectral) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]);

Now let's see how PCA will improve the results of a simple model that is not able to correctly fit all of the training data:
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, roc_auc_score # Train, test splits X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, stratify=y, random_state=42) # Decision trees with depth = 2 clf = DecisionTreeClassifier(max_depth=2, random_state=42) clf.fit(X_train, y_train) preds = clf.predict_proba(X_test) print('Accuracy: {:.5f}'.format(accuracy_score(y_test, preds.argmax(axis=1))))
Accuracy: 0.88889
Let's try this again, but, this time, let's reduce the dimensionality to 2 dimensions:
# Using PCA from sklearn PCA pca = decomposition.PCA(n_components=2) X_centered = X - X.mean(axis=0) pca.fit(X_centered) X_pca = pca.transform(X_centered) # Plotting the results of PCA plt.plot(X_pca[y == 0, 0], X_pca[y == 0, 1], 'bo', label='Setosa') plt.plot(X_pca[y == 1, 0], X_pca[y == 1, 1], 'go', label='Versicolour') plt.plot(X_pca[y == 2, 0], X_pca[y == 2, 1], 'ro', label='Virginica') plt.legend(loc=0);

# Test-train split and apply PCA X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=.3, stratify=y, random_state=42) clf = DecisionTreeClassifier(max_depth=2, random_state=42) clf.fit(X_train, y_train) preds = clf.predict_proba(X_test) print('Accuracy: {:.5f}'.format(accuracy_score(y_test, preds.argmax(axis=1))))
Accuracy: 0.91111
The accuracy did not increase significantly in this case, but, with other datasets with a high number of dimensions, PCA can drastically improve the accuracy of decision trees and other ensemble methods.
Now let's check out the percent of variance that can be explained by each of the selected components.
for i, component in enumerate(pca.components_): print("{} component: {}% of initial variance".format(i + 1, round(100 * pca.explained_variance_ratio_[i], 2))) print(" + ".join("%.3f x %s" % (value, name) for value, name in zip(component, iris.feature_names)))
1 component: 92.46% of initial variance 0.361 x sepal length (cm) + -0.085 x sepal width (cm) + 0.857 x petal length (cm) + 0.358 x petal width (cm) 2 component: 5.31% of initial variance 0.657 x sepal length (cm) + 0.730 x sepal width (cm) + -0.173 x petal length (cm) + -0.075 x petal width (cm)
Handwritten numbers dataset
Let's look at the handwritten numbers dataset that we used before in the 3rd lesson.
digits = datasets.load_digits() X = digits.data y = digits.target
Let's start by visualizing our data. Fetch the first 10 numbers. The numbers are represented by 8 x 8 matrixes with the color intensity for each pixel. Every matrix is flattened into a vector of 64 numbers, so we get the feature version of the data.
# f, axes = plt.subplots(5, 2, sharey=True, figsize=(16,6)) plt.figure(figsize=(16, 6)) for i in range(10): plt.subplot(2, 5, i + 1) plt.imshow(X[i,:].reshape([8,8]), cmap='gray');

Our data has 64 dimensions, but we are going to reduce it to only 2 and see that, even with just 2 dimensions, we can clearly see that digits separate into clusters.
pca = decomposition.PCA(n_components=2) X_reduced = pca.fit_transform(X) print('Projecting %d-dimensional data to 2D' % X.shape[1]) plt.figure(figsize=(12,10)) plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, edgecolor='none', alpha=0.7, s=40, cmap=plt.cm.get_cmap('nipy_spectral', 10)) plt.colorbar() plt.title('MNIST. PCA projection');
Projecting 64-dimensional data to 2D
Source: Sergey Korolev, https://web.archive.org/web/20210804210309/https://mlcourse.ai/articles/topic7-unsupervised/ This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.