Principal Component Analysis

Many approaches exist for reducing the dimension of feature vectors while still optimizing model evaluations. The subset selection approach is very useful and regularly applied. On the other hand, this approach may not reveal underlying relationships between the features or describe why certain features work well together while others do not. To do this, it is necessary to develop algorithms and compute recipes for mixing the most relevant features. Principal Component Analysis (PCA) is arguably one of the popular methodologies for achieving this goal.

Singular Value Decomposition (SVD)

Singular value decomposition is the key part of principal components analysis.

The SVD of the N × p matrix \mathbf{X} has the form \mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T.

  • \mathbf{U} = (\mathbf{u}_1, \mathbf{u}_2, \cdots , \mathbf{u}_N) is an N × N orthogonal matrix. \mathbf{u}_j, j = 1, \cdots , N, form an orthonormal basis for the space spanned by the column vectors of \mathbf{X}.
  • \mathbf{V} = (\mathbf{v}_1, \mathbf{v}_2, \cdots , \mathbf{v}_p) is an p × p orthogonal matrix. \mathbf{v}_j, j = 1, \cdots , p, form an orthonormal basis for the space spanned by the row vectors of \mathbf{X}.
  • \mathbf{D} is a N x p rectangular matrix with nonzero elements along the first p x p submatrix diagonal.  diag(\mathbf{d}_1, \mathbf{d}_2, \cdots , \mathbf{d}_p), \mathbf{d}_1 \geq \mathbf{d}_2 \geq \cdots \geq \mathbf{d}_p \geq 0 
are the singular values of \mathbf{X} with N > p.

The columns of \mathbf{V} (i.e., \mathbf{v}_j, j = 1, \cdots , 
p) are the eigenvectors of \mathbf{X}^T\mathbf{X}. They are called the principal component direction of \mathbf{X}.

The diagonal values in \mathbf{D} (i.e., \mathbf{d}_j, j = 1, 
\cdots , p) are the square roots of the eigenvalues of \mathbf{X}^T\mathbf{X}.


The Two-Dimensional Projection

The two-dimensional plane can be shown to be spanned by

  1. the linear combination of the variables that have maximum sample variance,
  2. the linear combination that has maximum variance subject to being uncorrelated with the first linear combination.

It can be extended to the k-dimensional projection. We can take the process further, seeking additional linear combinations that maximize the variance subject to being uncorrelated with all those already selected.