29 Principal Component Analysis (PCA)
Optimal linear dimensionality reduction via variance maximization.
29.1 Definitions and Preprocessing
Data matrix \(\bX \in \fR^{n\times p}\) (\(n\) samples, \(p\) features) must be centered: \[ \begin{align} X_{ij} \leftarrow X_{ij} - \bar{x}_j, \quad \bar{x}_j = \frac{1}{n}\sum_{i=1}^n X_{ij}. \end{align} \] Rule: Throughout this section, \(\bX\) is assumed centered.
The matrix \(\bS = \frac{1}{n-1}\bX^T\bX \in \fR^{p\times p}\).
Diagonal: \(S_{jj}\) is the variance of feature \(j\).
Off-diagonal: \(S_{ij}\) is the covariance between features \(i\) and \(j\).
Property: \(\bS\) is symmetric positive semi-definite (SPSD).
29.2 PCA via SVD
Let \(\bX = \bU\bsigma\bV^T\) be the SVD of the centered data matrix.
Principal Directions (Loadings): Columns of \(\bV\) (eigenvectors of \(\bS\)).
Principal Components (Scores): \(\bZ = \bX\bV = \bU\bsigma\).
Variances: \(\lambda_i = \text{Var}(\bz_i) = \frac{\sigma_i^2}{n-1}\).
Numerical Stability: Never form \(\bX^T\bX\) explicitly for PCA. Squaring the data matrix doubles the condition number (\(\kappa(\bS) = \kappa(\bX)^2\)). Compute PCA directly via the SVD of \(\bX\).
29.3 Variance and Dimensionality Reduction
For the first \(k\) components: \[ \begin{align} \text{PVE}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^p \sigma_i^2}. \end{align} \] Rule: Choose \(k\) to reach a target PVE (e.g., 95%) or identify the “elbow” in a scree plot.
The rank-\(k\) matrix \(\bX_k = \sum_{i=1}^k \sigma_i \bu_i \bv_i^T\) is the optimal \(k\)-dimensional linear approximation of the data in the Frobenius norm (Eckart-Young).
29.4 Implementation Details
Standardization: If features have different units (e.g., height in meters vs. weight in kg), PCA will be dominated by large-scale features. Standardize by dividing each centered column by its standard deviation: \(X_{ij} \leftarrow (X_{ij} - \bar{x}_j)/s_j\).
When \(p \gg n\), work with the \(n \times n\) Gram matrix \(\mathbf{K} = \bX\bX^T\). Principal directions are recovered via \(\bv_i = \bX^T \bu_i / \sigma_i\).
29.5 Exercises
Center \(\bX = \begin{pmatrix} 2 & 1 \\ -1 & 3 \\ -1 & -4 \end{pmatrix}\) and compute \(\bS\). Find the PVE for \(k=1\).
Prove that PC scores \(\bz_i, \bz_j\) are uncorrelated for \(i \neq j\).
Load the Iris dataset. Plot the data in the basis of the first two PCs. Do species cluster?
Verify that the sum of variances \(\sum \lambda_i\) equals the total variance \(\text{Tr}(\bS)\).
The score matrix is \(\bZ = \bX\bV\). The covariance of the scores is: \[ \begin{align} \text{Cov}(\bZ) &= \frac{1}{n-1} \bZ^T \bZ \\ &= \frac{1}{n-1} (\bX\bV)^T (\bX\bV) \\ &= \bV^T \left( \frac{1}{n-1} \bX^T \bX \right) \bV. \end{align} \] Since \(\frac{1}{n-1} \bX^T \bX = \bS\) and \(\bV\) contains the eigenvectors of \(\bS\): \[ \begin{align} \text{Cov}(\bZ) &= \bV^T \bS \bV \\ &= \bV^T (\bV \boldsymbol{\Lambda} \bV^T) \bV \\ &= \boldsymbol{\Lambda}. \end{align} \] Since \(\boldsymbol{\Lambda}\) is diagonal, the covariance between \(\bz_i\) and \(\bz_j\) is zero for \(i \neq j\).