Principal Component Analysis
Site: | Saylor Academy |
Course: | CS250: Python for Data Science |
Book: | Principal Component Analysis |
Printed by: | Guest user |
Date: | Friday, 4 April 2025, 6:48 AM |
Description
Many approaches exist for reducing the dimension of feature vectors while still optimizing model evaluations. The subset selection approach is very useful and regularly applied. On the other hand, this approach may not reveal underlying relationships between the features or describe why certain features work well together while others do not. To do this, it is necessary to develop algorithms and compute recipes for mixing the most relevant features. Principal Component Analysis (PCA) is arguably one of the popular methodologies for achieving this goal.
Overview
Principal Component Analysis (PCA) is a method of dimension reduction. This is not directly related to the prediction problem, but several regression methods are directly dependent on it. The regression methods (PCR and PLS) will be considered later. Now a motivation for dimension reduction is being set up.
Notation
The input matrix X of dimension :
The rows of the above matrix represent the cases or observations.
The columns represent the variables observed in each unit. These represent the characteristics.
Assume that the columns of X are centered, i.e., the estimated column mean is subtracted from each column.
Objectives
- Introducing Principal Component Analysis.
- The precursor to Regression Techniques after Dimension Reduction.
Source: The Pennsylvania State University, https://online.stat.psu.edu/stat508/lesson/6 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.
Singular Value Decomposition (SVD)
Singular value decomposition is the key part of principal components analysis.The SVD of the matrix
has the form
.
is an N × N orthogonal matrix.
, form an orthonormal basis for the space spanned by the column vectors of
.
is an p × p orthogonal matrix.
, form an orthonormal basis for the space spanned by the row vectors of
.
is a N x p rectangular matrix with nonzero elements along the first p x p submatrix diagonal.
,
are the singular values of
with N > p.
The columns of (i.e.,
are the eigenvectors of
. They are called the principal component direction of
.
The diagonal values in (i.e.,
are the square roots of the eigenvalues of
.
The Two-Dimensional Projection
The two-dimensional plane can be shown to be spanned by
- the linear combination of the variables that have maximum sample variance,
- the linear combination that has maximum variance subject to being uncorrelated with the first linear combination.
It can be extended to the k-dimensional projection. We can take the process further, seeking additional linear combinations that maximize the variance subject to being uncorrelated with all those already selected.
Principal Components
Principal components analysis is one of the most common methods used for linear dimension reduction. The motivation behind dimension reduction is that the process gets unwieldy with a large number of variables, while the large number does not add any new information to the process. An analogy may be drawn with variance inflation factors in multiple regression. If VIF corresponding to any predictor is large, that predictor is not included in the model, as that variable does not contribute any new information. On the other hand, because of linear dependence, the regression matrix may become singular. In a multivariate situation, it may well happen that a few (or a large number of) variables have high interdependence. A linear combination of variables is then considered, which are orthogonal to one another, but the total variability within the sample is preserved as much as possible.
Suppose the data is 10-dimensional but needs to be reduced to 2-dimensional. The idea of principal component analysis is to use two directions that capture the variation in the data as much as possible.
[Keep in mind that the dimensions to which data needs to be reduced are usually not pre-fixed. After taking a look at the total proportion of variability captured, the reduction in dimension is determined. Here reduction of 10-dimensional space to 2-dimension space is for illustration only]
The sample covariance matrix of X is given as:
If you do the Eigen decomposition of :
It turns out that if you have done the singular value decomposition, then you already have the Eigenvalue decomposition for .
The is the diagonal part of matrix D with every element on the diagonal squared.
Also, we should point out that we can show using linear algebra that
is a semi-positive definite matrix. This means that all of the
eigenvalues are guaranteed to be non-negative. The eigenvalues are in
matrix
. Since these values are squared, every diagonal element
is non-negative.
The eigenvectors of ,
, can be obtained either by
doing an Eigen decomposition of
or by doing a singular value
decomposition from X. These vjs are called principal component directions of X. If you project X onto the directions of the principal components, you get the principal components.
- It's easy to see that
. Hence
is simply the projection of the row vectors of
, i.e., the input predictor vectors, on the direction
, scaled by
. For example:
- The principal components of
are
.
- The first principal component of
,
, has the largest sample variance amongst all normalized linear combinations of the columns of
.
- Subsequent principal components
have maximum variance
, subject to being orthogonal to the earlier ones.
Why are we interested in this? Consider an extreme case (lower right) where your data all lie in one direction. Although two features represent the data, we can reduce the dimension of the dataset to one using a single linear combination of the features (as given by the first principal component).
Just to summarize the way you do dimension reduction: First, you have
X, and you remove the means. On X, you do a singular value
decomposition and obtain matrices U, D, and V. Then you call every
column vector in V, , where j = 1, ... , p. This vj
is called the direction of the principal component. Let's say your original
dimension is 10, and we wish to reduce this to 2 dimensions. What would
we do? We would take
and
and project X to
and X
to
. To project, multiply X by
for a column vector of
size N. X times
gives you another column vector of size N. These two column vectors reduce the dimensions, so you have:
Principal Components Analysis (PCA)
Objective
Capture the intrinsic variability in the data.
Reduce the dimensionality of a data set, either to ease interpretation or as a way to avoid overfitting and to prepare for subsequent analysis.
The sample covariance matrix of is
, since
has zero mean.
The eigenvectors of (i.e.,
) are called principal component directions of
.
The first principal component direction has the following properties that
is the eigenvector associated with the largest eigenvalue,
, of
.
has the largest sample variance amongst all normalized linear combinations of the columns of X.
is called the first principal component of
. And, we have
.
The second principal component direction (the
direction orthogonal to the first component that has the largest
projected variance) is the eigenvector corresponding to the second
largest eigenvalue,
, of
,
and so on. (The eigenvector for the
largest eigenvalue
corresponds to the
principal component direction
.)
The principal component of
,
, has maximum variance
, subject
to being orthogonal to the earlier ones.
Geometric Interpretation
Principal components analysis (PCA) projects the data along the directions where the data varies the most.
The first direction is decided by corresponding to the largest eigenvalue
.
The second direction is decided by corresponding to the second largest eigenvalue
.
The variance of the data along the principal component directions is associated with the magnitude of the eigenvalues.
Choice of How Many Components to Extract
Scree Plot – This is a useful visual aid that shows the amount of variance explained by each consecutive eigenvalue.
The choice of how many components to extract is fairly arbitrary.
When conducting principal components analysis prior to further analyses, it is risky to choose too small a number of components, which may fail to explain enough of the variability in the data.