29 Principal Component Analysis (PCA)

Optimal linear dimensionality reduction via variance maximization.

29.1 Definitions and Preprocessing

Definition: Data Centering

Data matrix \(\bX \in \fR^{n\times p}\) (\(n\) samples, \(p\) features) must be centered: \[ \begin{align} X_{ij} \leftarrow X_{ij} - \bar{x}_j, \quad \bar{x}_j = \frac{1}{n}\sum_{i=1}^n X_{ij}. \end{align} \] Throughout this section, \(\bX\) is assumed centered.

Definition: Sample Covariance Matrix

The matrix \(\bS = \frac{1}{n-1}\bX^T\bX \in \fR^{p\times p}\).

Diagonal: \(S_{jj}\) is the variance of feature \(j\).
Off-diagonal: \(S_{ij}\) is the covariance between features \(i\) and \(j\).
Property: \(\bS\) is symmetric positive semidefinite (SPSD).

Exercise

Show that centering makes each column of \(\bX\) have mean zero.
Prove that the diagonal entries of \(\bS\) are sample variances.
Prove that \(\bS\) is symmetric positive semidefinite by computing \(\bz^T\bS\bz\).

29.2 PCA via SVD

Theorem: Principal Directions and Scores

Let \(\bX = \bU\bsigma\bV^T\) be the SVD of the centered data matrix.

Principal Directions (Loadings): Columns of \(\bV\) (eigenvectors of \(\bS\)).
Principal Components (Scores): \(\bZ = \bX\bV = \bU\bsigma\).
Variances: \(\lambda_i = \text{Var}(\bz_i) = \frac{\sigma_i^2}{n-1}\).

Proof

Since \(\bX=\bU\bsigma\bV^T\), \[ \begin{align} \bS = \frac{1}{n-1}\bX^T\bX = \bV\left(\frac{\bsigma^2}{n-1}\right)\bV^T. \end{align} \] Thus the columns of \(\bV\) diagonalize the covariance matrix. The scores satisfy \(\bZ=\bX\bV=\bU\bsigma\), so their sample covariance is diagonal with entries \(\sigma_i^2/(n-1)\).

Remark

(PCA stability) Never form \(\bX^T\bX\) explicitly for PCA. Squaring the data matrix doubles the condition number (\(\kappa(\bS) = \kappa(\bX)^2\)). Compute PCA directly via the SVD of \(\bX\).

Exercise

Starting from \(\bX=\bU\bsigma\bV^T\), compute \(\bS=(n-1)^{-1}\bX^T\bX\).
Show that the columns of \(\bV\) are eigenvectors of \(\bS\).
Show that the corresponding eigenvalues are \(\lambda_i=\sigma_i^2/(n-1)\).
Compute the score matrix \(\bZ=\bX\bV\) and prove that \(\operatorname{Cov}(\bZ)\) is diagonal.
Use the result above to explain the stability warning above.

29.3 Variance and Dimensionality Reduction

Definition: Proportion of Variance Explained (PVE)

For the first \(k\) components: \[ \begin{align} \text{PVE}_k = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^p \sigma_i^2}. \end{align} \]

Theorem: Optimal Approximation

The rank-\(k\) matrix \[ \begin{align} \bX_k = \sum_{i=1}^k \sigma_i \bu_i \bv_i^T \end{align} \] is the optimal \(k\)-dimensional linear approximation of the data in the Frobenius norm, by the Eckart-Young theorem.

Exercise

Use the result above to prove the result above.
Show that \(\|\bX-\bX_k\|_F^2=\sum_{i>k}\sigma_i^2\).
Show that \(\text{PVE}_k=1-\|\bX-\bX_k\|_F^2/\|\bX\|_F^2\).
Explain why a PVE threshold is a modeling choice, not a theorem.

29.4 Implementation Details

Remark

(Standardization) If features have different units, PCA will be dominated by large-scale features. Standardize by dividing each centered column by its standard deviation: \(X_{ij} \leftarrow (X_{ij} - \bar{x}_j)/s_j\).

Remark

(Interpretation) Principal components are variance directions, not causal factors. A large loading identifies a direction of variation in the data matrix; it does not by itself explain why that variation occurs.

Definition: Kernel PCA (High-Dimensional Case)

When \(p \gg n\), work with the \(n \times n\) Gram matrix \(\mathbf{K} = \bX\bX^T\). Principal directions are recovered via \(\bv_i = \bX^T \bu_i / \sigma_i\).

29.5 Exercises

Exercise

Center \(\bX = \begin{pmatrix} 2 & 1 \\ -1 & 3 \\ -1 & -4 \end{pmatrix}\) and compute \(\bS\). Find the PVE for \(k=1\).
Prove that PC scores \(\bz_i, \bz_j\) are uncorrelated for \(i \neq j\) using the result above.
Load the Iris dataset. Plot the data in the basis of the first two PCs. Do species cluster?
Verify that the sum of variances \(\sum \lambda_i\) equals the total variance \(\text{Tr}(\bS)\).