Principal Component Analysis

Many approaches exist for reducing the dimension of feature vectors while still optimizing model evaluations. The subset selection approach is very useful and regularly applied. On the other hand, this approach may not reveal underlying relationships between the features or describe why certain features work well together while others do not. To do this, it is necessary to develop algorithms and compute recipes for mixing the most relevant features. Principal Component Analysis (PCA) is arguably one of the popular methodologies for achieving this goal.

Overview

Principal Component Analysis (PCA) is a method of dimension reduction. This is not directly related to the prediction problem, but several regression methods are directly dependent on it. The regression methods (PCR and PLS) will be considered later. Now a motivation for dimension reduction is being set up.


Notation

The input matrix X of dimension N \times p:

\begin{pmatrix}
                    x_{1,1} & x_{1,2} & ... & x_{1,p} \\
                    x_{2,1} & x_{2,2} & ... & x_{2,p}\\
                    ... & ... & ... & ...\\
                    x_{N,1} & x_{N,2} & ... & x_{N,p}
                    \end{pmatrix}

The rows of the above matrix represent the cases or observations.

The columns represent the variables observed in each unit. These represent the characteristics.

Assume that the columns of X are centered, i.e., the estimated column mean is subtracted from each column.


Objectives

Upon successful completion of this lesson, you should be able to:

  • Introducing Principal Component Analysis.
  • The precursor to Regression Techniques after Dimension Reduction.

Source: The Pennsylvania State University, https://online.stat.psu.edu/stat508/lesson/6
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.