This tutorial introduces examples, including the analysis of handwritten digits, and then applies PCA to reduce the dimensionality of the data set. Observe how it connects with programming concepts introduced in the previous unit dealing with PCA.
Principal Component Analysis (PCA)
Intuition, theories, and application issues
Principal Component Analysis is one of the easiest, most intuitive, and most frequently used methods for dimensionality reduction, projecting data onto its orthogonal feature subspace.
More generally speaking, all observations can be considered as an ellipsoid in a subspace of an initial feature space, and the new basis set in this subspace is aligned with the ellipsoid axes. This assumption lets us remove highly correlated features since basis set vectors are orthogonal. In the general case, the resulting ellipsoid dimensionality matches the initial space dimensionality, but the assumption that our data lies in a subspace with a smaller dimension allows us to cut off the "excessive" space with the new projection (subspace). We accomplish this in a 'greedy' fashion, sequentially selecting each of the ellipsoid axes by identifying where the dispersion is maximal.
"To deal with hyper-planes in a 14 dimensional space, visualize a 3D space and say 'fourteen' very loudly. Everyone does it." - Geoffrey Hinton
Let's take a look at the mathematical formulation of this process:
In order to decrease the dimensionality of our data from to
with
, we sort our list of axes in order of decreasing
dispersion and take the top-
of them.
We begin by computing the dispersion and the covariance of the
initial features. This is usually done with the covariance matrix.
According to the covariance definition, the covariance of two features
is computed as follows: , where
is the expected value of
the
th feature. It is worth noting that the covariance is symmetric,
and the covariance of a vector with itself is equal to its dispersion.
Therefore the covariance matrix is symmetric with the dispersion of
the corresponding features on the diagonal. Non-diagonal values are the
covariances of the corresponding pair of features. In terms of matrices
where is the matrix of observations, the covariance matrix
is as follows:
Quick
recap: matrices, as linear operators, have eigenvalues and
eigenvectors. They are very convenient because they describe parts of
our space that do not rotate and only stretch when we apply linear
operators on them; eigenvectors remain in the same direction but are
stretched by a corresponding eigenvalue. Formally, a matrix with
eigenvector
and eigenvalue
satisfy this equation:
.
The covariance matrix for a sample can be written as a product of
. According to the Rayleigh quotient,
the maximum variation of our sample lies along the eigenvector of this
matrix and is consistent with the maximum eigenvalue. Therefore, the
principal components we aim to retain from the data are just the
eigenvectors corresponding to the top-
largest eigenvalues of the
matrix.
The next steps are easier to digest. We multiply the matrix of our
data by these components to get the projection of our data onto the
orthogonal basis of the chosen components. If the number of components
was smaller than the initial space dimensionality, remember that we will
lose some information upon applying this transformation.
Examples
Fisher's iris dataset
Let's start by uploading all of the essential modules and try out the iris example from the scikit-learn
documentation.
import numpy as np import matplotlib.pyplot as plt import seaborn as sns; sns.set(style='white') %matplotlib inline %config InlineBackend.figure_format = 'retina' from sklearn import decomposition from sklearn import datasets from mpl_toolkits.mplot3d import Axes3D # Loading the dataset iris = datasets.load_iris() X = iris.data y = iris.target # Let's create a beautiful 3d-plot fig = plt.figure(1, figsize=(6, 5)) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]: ax.text3D(X[y == label, 0].mean(), X[y == label, 1].mean() + 1.5, X[y == label, 2].mean(), name, horizontalalignment='center', bbox=dict(alpha=.5, edgecolor='w', facecolor='w')) # Change the order of labels, so that they match y_clr = np.choose(y, [1, 2, 0]).astype(np.float) ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y_clr, cmap=plt.cm.nipy_spectral) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]);

Now let's see how PCA will improve the results of a simple model that is not able to correctly fit all of the training data:
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, roc_auc_score # Train, test splits X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, stratify=y, random_state=42) # Decision trees with depth = 2 clf = DecisionTreeClassifier(max_depth=2, random_state=42) clf.fit(X_train, y_train) preds = clf.predict_proba(X_test) print('Accuracy: {:.5f}'.format(accuracy_score(y_test, preds.argmax(axis=1))))
Accuracy: 0.88889
Let's try this again, but, this time, let's reduce the dimensionality to 2 dimensions:
# Using PCA from sklearn PCA pca = decomposition.PCA(n_components=2) X_centered = X - X.mean(axis=0) pca.fit(X_centered) X_pca = pca.transform(X_centered) # Plotting the results of PCA plt.plot(X_pca[y == 0, 0], X_pca[y == 0, 1], 'bo', label='Setosa') plt.plot(X_pca[y == 1, 0], X_pca[y == 1, 1], 'go', label='Versicolour') plt.plot(X_pca[y == 2, 0], X_pca[y == 2, 1], 'ro', label='Virginica') plt.legend(loc=0);

# Test-train split and apply PCA X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=.3, stratify=y, random_state=42) clf = DecisionTreeClassifier(max_depth=2, random_state=42) clf.fit(X_train, y_train) preds = clf.predict_proba(X_test) print('Accuracy: {:.5f}'.format(accuracy_score(y_test, preds.argmax(axis=1))))
Accuracy: 0.91111
The accuracy did not increase significantly in this case, but, with other datasets with a high number of dimensions, PCA can drastically improve the accuracy of decision trees and other ensemble methods.
Now let's check out the percent of variance that can be explained by each of the selected components.
for i, component in enumerate(pca.components_): print("{} component: {}% of initial variance".format(i + 1, round(100 * pca.explained_variance_ratio_[i], 2))) print(" + ".join("%.3f x %s" % (value, name) for value, name in zip(component, iris.feature_names)))
1 component: 92.46% of initial variance 0.361 x sepal length (cm) + -0.085 x sepal width (cm) + 0.857 x petal length (cm) + 0.358 x petal width (cm) 2 component: 5.31% of initial variance 0.657 x sepal length (cm) + 0.730 x sepal width (cm) + -0.173 x petal length (cm) + -0.075 x petal width (cm)
Handwritten numbers dataset
Let's look at the handwritten numbers dataset that we used before in the 3rd lesson.
digits = datasets.load_digits() X = digits.data y = digits.target
Let's start by visualizing our data. Fetch the first 10 numbers. The numbers are represented by 8 x 8 matrixes with the color intensity for each pixel. Every matrix is flattened into a vector of 64 numbers, so we get the feature version of the data.
# f, axes = plt.subplots(5, 2, sharey=True, figsize=(16,6)) plt.figure(figsize=(16, 6)) for i in range(10): plt.subplot(2, 5, i + 1) plt.imshow(X[i,:].reshape([8,8]), cmap='gray');

Our data has 64 dimensions, but we are going to reduce it to only 2 and see that, even with just 2 dimensions, we can clearly see that digits separate into clusters.
pca = decomposition.PCA(n_components=2) X_reduced = pca.fit_transform(X) print('Projecting %d-dimensional data to 2D' % X.shape[1]) plt.figure(figsize=(12,10)) plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, edgecolor='none', alpha=0.7, s=40, cmap=plt.cm.get_cmap('nipy_spectral', 10)) plt.colorbar() plt.title('MNIST. PCA projection');
Projecting 64-dimensional data to 2D
Indeed, with t-SNE, the picture looks better since PCA has a linear constraint while t-SNE does not. However, even with such a small dataset, the t-SNE algorithm takes significantly more time to complete than PCA.
%%time
from sklearn.manifold import TSNE
tsne = TSNE(random_state=17)
X_tsne = tsne.fit_transform(X)
plt.figure(figsize=(12,10))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y,
edgecolor='none', alpha=0.7, s=40,
cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar()
plt.title('MNIST. t-SNE projection');
CPU times: user 12.6 s, sys: 24 ms, total: 12.6 s
Wall time: 12.6 s
Text(0.5, 1.0, 'MNIST. t-SNE projection')
In practice, we would choose the number of principal components such that we can explain 90% of the initial data dispersion (via the explained_variance_ratio). Here, that means retaining 21 principal components; therefore, we reduce the dimensionality from 64 features to 21.
pca = decomposition.PCA().fit(X)
plt.figure(figsize=(10,7))
plt.plot(np.cumsum(pca.explained_variance_ratio_), color='k', lw=2)
plt.xlabel('Number of components')
plt.ylabel('Total explained variance')
plt.xlim(0, 63)
plt.yticks(np.arange(0, 1.1, 0.1))
plt.axvline(21, c='b')
plt.axhline(0.9, c='r')
plt.show();