29  Principal Component Analysis (PCA)

Optimal linear dimensionality reduction via variance maximization.

29.1 Definitions and Preprocessing

NoteDefinition: Data Centering

Data matrix \(\bX \in \fR^{n\times p}\) (\(n\) samples, \(p\) features) must be centered: \[ \begin{align} X_{ij} \leftarrow X_{ij} - \bar{x}_j, \quad \bar{x}_j = \frac{1}{n}\sum_{i=1}^n X_{ij}. \end{align} \] Rule: Throughout this section, \(\bX\) is assumed centered.

NoteDefinition: Sample Covariance Matrix

The matrix \(\bS = \frac{1}{n-1}\bX^T\bX \in \fR^{p\times p}\).

  • Diagonal: \(S_{jj}\) is the variance of feature \(j\).

  • Off-diagonal: \(S_{ij}\) is the covariance between features \(i\) and \(j\).

  • Property: \(\bS\) is symmetric positive semi-definite (SPSD).

29.2 PCA via SVD

NoteTheorem: Principal Directions and Scores

Let \(\bX = \bU\bsigma\bV^T\) be the SVD of the centered data matrix.

  • Principal Directions (Loadings): Columns of \(\bV\) (eigenvectors of \(\bS\)).

  • Principal Components (Scores): \(\bZ = \bX\bV = \bU\bsigma\).

  • Variances: \(\lambda_i = \text{Var}(\bz_i) = \frac{\sigma_i^2}{n-1}\).

Tip

Numerical Stability: Never form \(\bX^T\bX\) explicitly for PCA. Squaring the data matrix doubles the condition number (\(\kappa(\bS) = \kappa(\bX)^2\)). Compute PCA directly via the SVD of \(\bX\).

29.3 Variance and Dimensionality Reduction

NoteDefinition: Proportion of Variance Explained (PVE)

For the first \(k\) components: \[ \begin{align} \text{PVE}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^p \sigma_i^2}. \end{align} \] Rule: Choose \(k\) to reach a target PVE (e.g., 95%) or identify the “elbow” in a scree plot.

NoteTheorem: Optimal Approximation

The rank-\(k\) matrix \(\bX_k = \sum_{i=1}^k \sigma_i \bu_i \bv_i^T\) is the optimal \(k\)-dimensional linear approximation of the data in the Frobenius norm (Eckart-Young).

29.4 Implementation Details

Tip

Standardization: If features have different units (e.g., height in meters vs. weight in kg), PCA will be dominated by large-scale features. Standardize by dividing each centered column by its standard deviation: \(X_{ij} \leftarrow (X_{ij} - \bar{x}_j)/s_j\).

NoteDefinition: Kernel PCA (High-Dimensional Case)

When \(p \gg n\), work with the \(n \times n\) Gram matrix \(\mathbf{K} = \bX\bX^T\). Principal directions are recovered via \(\bv_i = \bX^T \bu_i / \sigma_i\).

29.5 Exercises

WarningExercise
  1. Center \(\bX = \begin{pmatrix} 2 & 1 \\ -1 & 3 \\ -1 & -4 \end{pmatrix}\) and compute \(\bS\). Find the PVE for \(k=1\).

  2. Prove that PC scores \(\bz_i, \bz_j\) are uncorrelated for \(i \neq j\).

  3. Load the Iris dataset. Plot the data in the basis of the first two PCs. Do species cluster?

  4. Verify that the sum of variances \(\sum \lambda_i\) equals the total variance \(\text{Tr}(\bS)\).

The score matrix is \(\bZ = \bX\bV\). The covariance of the scores is: \[ \begin{align} \text{Cov}(\bZ) &= \frac{1}{n-1} \bZ^T \bZ \\ &= \frac{1}{n-1} (\bX\bV)^T (\bX\bV) \\ &= \bV^T \left( \frac{1}{n-1} \bX^T \bX \right) \bV. \end{align} \] Since \(\frac{1}{n-1} \bX^T \bX = \bS\) and \(\bV\) contains the eigenvectors of \(\bS\): \[ \begin{align} \text{Cov}(\bZ) &= \bV^T \bS \bV \\ &= \bV^T (\bV \boldsymbol{\Lambda} \bV^T) \bV \\ &= \boldsymbol{\Lambda}. \end{align} \] Since \(\boldsymbol{\Lambda}\) is diagonal, the covariance between \(\bz_i\) and \(\bz_j\) is zero for \(i \neq j\).