23 Conjugate Gradient Method

The Conjugate Gradient (CG) method is a standard iterative solver for large, sparse, symmetric positive definite (SPD) linear systems. It belongs to the class of Krylov subspace methods, which seek an approximate solution within a space spanned by powers of the matrix applied to the initial residual.

Remark

(Assumption) CG requires \(\bA\) to be symmetric positive definite. If \(\bA\) is nonsymmetric, use GMRES or another nonsymmetric Krylov method. If \(\bA\) is symmetric indefinite, use methods designed for indefinite systems, such as MINRES.

For SPD \(\bA\), solving \(\bA\bx=\bb\) is equivalent to minimizing a convex quadratic energy. CG chooses search directions that do not interfere with previous progress, so each step is optimal over a growing Krylov subspace.

23.1 Krylov Subspaces

Definition: Krylov Subspace

The \(k\)-th Krylov subspace \(\mathcal{K}_k(\bA, \bb)\) is defined as the span of the first \(k\) vectors in the sequence \(\{\bb, \bA\bb, \bA^2\bb, ... \}\): \[ \begin{align} \mathcal{K}_k(\bA, \bb) = \text{span}\{ \bb, \bA\bb, \bA^2\bb, ..., \bA^{k-1}\bb \}. \end{align} \] Krylov methods are efficient because they only require the ability to compute matrix-vector products \(\bA\bv\), making them ideal for sparse matrices where \(\bA\) is never explicitly stored.

23.2 CG as a Projection Method

For SPD \(\bA\), the quadratic \[ \begin{align} \phi(\bx)=\frac{1}{2}\bx^T\bA\bx-\bb^T\bx \end{align} \] has gradient \(\nabla\phi(\bx)=\bA\bx-\bb\). Therefore \(\nabla\phi(\bx)=0\) exactly when \(\bA\bx=\bb\). Solving the linear system is the same as finding the unique minimizer of \(\phi\).

Remark

Error Minimization Properties: For an SPD system \(\bA\bx = \bb\), the CG method identifies the unique vector \(\bx_k \in \mathcal{K}_k(\bA, \bb)\) that minimizes the \(\bA\)-norm of the error: \[ \begin{align} \bx_k = \arg\min_{\by \in \mathcal{K}_k} \|\bx^* - \by\|_\bA, \end{align} \] where \(\bx^*\) is the exact solution and \(\|\be\|_\bA = \sqrt{\be^T\bA\be}\). This is equivalent to minimizing the quadratic energy functional \(\phi(\by) = \frac{1}{2} \by^T \bA \by - \by^T \bb\).

23.3 Conjugate Directions and A-Orthogonality

Remark

The Basis of Directions: To find this minimum efficiently, CG constructs a basis of search directions \(\{\bp_0, \bp_1, ..., \bp_{k-1}\}\) that are \(\bA\)-orthogonal (or conjugate), meaning \(\bp_i^T \bA \bp_j = 0\) for \(i \neq j\).

Global Optimality: Because directions are \(\bA\)-orthogonal, the \(k\)-th update \(\bx_k = \bx_{k-1} + \alpha_{k-1}\bp_{k-1}\) is guaranteed to be the best possible update in the entire subspace \(\mathcal{K}_k\). There is no need to revisit previous directions.
Short Recurrence: Due to the symmetry of \(\bA\), the new conjugate direction \(\bp_k\) can be computed using only the current residual and the single previous direction. This leads to the \(O(n)\) storage requirement.

Remark

(Residuals vs. directions) In exact arithmetic, CG residuals are mutually orthogonal in the Euclidean inner product, while search directions are mutually orthogonal in the \(\bA\)-inner product. These two orthogonality structures are what make the short recurrence possible.

Remark

Why not steepest descent? Steepest descent always moves in the negative gradient direction, but successive gradients can undo previous progress by zig-zagging across narrow energy valleys. CG corrects this by choosing conjugate directions adapted to the geometry of \(\bA\).

23.4 The CG Algorithm

Definition: Conjugate Gradient Algorithm

Initialize: \(\bx^{(0)} = \bzero, \br^{(0)} = \bb, \bp^{(0)} = \br^{(0)}\).\ Iteration:

(Step length)
(Update iterate)
(Update residual)
(Gram-Schmidt weight)
(Next direction)

Remark

Efficiency:

Matvecs: Only 1 matrix-vector product per iteration.
Memory: \(O(n)\) storage (requires only a few vectors). No need to store the basis.

Remark

(Practical stopping) A common stopping rule is \[ \begin{align} \frac{\|\br^{(k)}\|_2}{\|\bb\|_2} \leq `tol`. \end{align} \] For ill-conditioned systems, residual decrease may not translate directly into error decrease. The condition number controls that conversion.

23.5 Convergence Rate

Theorem: Convergence Bound

\[ \begin{align} \frac{\|\be^{(k)}\|_\bA}{\|\be^{(0)}\|_\bA} \leq 2 \left( \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} \right)^k. \end{align} \]

Remark

While stationary methods depend on \((\kappa-1)/(\kappa+1)\), CG depends on the square root of the condition number. For \(\kappa=100\), CG reduces error \(5\times\) faster per step.

Remark

Superlinear Convergence: If eigenvalues are clustered, CG converges much faster than the worst-case bound suggests (e.g., \(k \ll n\) for clustered spectra).

Remark

(Polynomial view) CG chooses a degree-\(k\) polynomial \(p_k\) with \(p_k(0)=1\) so that \[ \begin{align} \be^{(k)} = p_k(\bA)\be^{(0)} \end{align} \] is small in the \(\bA\)-norm. Good convergence occurs when a low-degree polynomial can be small on the eigenvalues of \(\bA\).

23.6 Preconditioning

The convergence of CG depends on the condition number \(\kappa(\bA)\). Preconditioning transforms the system using a symmetric positive definite matrix \(\bM\) that approximates \(\bA\) but is much easier to invert.

Remark

Spectral Transformation: The goal of the preconditioner is to cluster the eigenvalues of \(\bM^{-1}\bA\) near 1. A perfectly preconditioned system (\(\bM = \bA\)) has all eigenvalues equal to 1 and converges in a single step. In practice, a good preconditioner reduces the effective condition number, significantly accelerating convergence.

Remark

(Preconditioning is not optional in hard problems) For large PDE systems, unpreconditioned CG can require too many iterations. The preconditioner is often the main algorithmic design choice.

Definition: Common Preconditioners

Jacobi: \(\bM = \text{diag}(\bA)\). (Cheap, moderate speedup).
Incomplete Cholesky (IC): \(\bM \approx \bL\bL^T\) with sparse factors. (Robust).
Multigrid: Near \(O(n)\) total complexity for PDE systems.

Exercise

Convergence Check: Solve a \(100 \times 100\) Poisson system via CG. Compare your error plot to the theoretical bound.
Spectral Clustering: Construct \(\bA\) with eigenvalues \(\{1, ..., 1, 100, 200\}\). Observe superlinear convergence in the first few steps.
Preconditioning: Compare iteration counts for a sparse system with and without an IC preconditioner.