Direct solvers (LU, Thomas) factor the matrix once and solve in $O(n^2)$ per right-hand side; they are exact up to floating-point and preferred when $n$ is moderate or many right-hand sides share the same coefficient matrix. Iterative solvers (Jacobi, Gauss-Seidel, CG, GMRES) never form a factorization; they produce a sequence of approximate solutions and are preferred for large sparse systems where factorization fill-in would be prohibitive.

Thomas Algorithm (Tridiagonal)

The Tridiagonal System

Solve $Tx = d$ where

\[ T = \begin{pmatrix} b_0 & c_0 & & \\ a_1 & b_1 & c_1 & \\ & a_2 & b_2 & \ddots \\ & & \ddots & \ddots & c_{n-2} \\ & & & a_{n-1} & b_{n-1} \end{pmatrix} \]

Tridiagonal systems arise in 1D finite differences, implicit time integration (Crank-Nicolson), cubic spline interpolation, and ADI splitting for 2D/3D parabolic equations.

LU Factorization View

The Thomas algorithm is Gaussian elimination specialized to tridiagonal structure. The LU factors have the form

\[ L = \begin{pmatrix} 1 \\ l_0 & 1 \\ & l_1 & \ddots \\ & & \ddots & 1 \end{pmatrix}, \quad U = \begin{pmatrix} u_0 & c_0 \\ & u_1 & c_1 \\ & & \ddots & \ddots \\ & & & u_{n-1} \end{pmatrix} \]

where the recurrences follow directly from $A = LU$:

\[ l_{i-1} = \frac{a_{i-1}}{u_{i-1}}, \qquad u_i = b_i - l_{i-1}\, c_{i-1} \]

Forward Sweep and Back Substitution

Forward sweep (eliminate sub-diagonal, modify $d$ in place):

\[ l \leftarrow a_i / u_{i-1}, \quad u_i \leftarrow b_i - l\, c_{i-1}, \quad d_i \leftarrow d_i - l\, d_{i-1}, \quad i = 1, \ldots, n-1 \]

Backward substitution:

\[ x_{n-1} = d_{n-1} / u_{n-1}, \qquad x_i = (d_i - c_i\, x_{i+1}) / u_i, \quad i = n-2, \ldots, 0 \]

Complexity and Stability

The grand total is $8n - 7$ FLOPs — optimal, since reading the $O(n)$ input already costs $O(n)$.

The algorithm is unconditionally stable when $T$ is strictly diagonally dominant ( $|b_i| > |a_i| + |c_i|$ for all $i$) or symmetric positive definite. Under diagonal dominance the modified diagonal $|u_i| > |c_i|$ remains bounded away from zero at every step.

GPU Batched Variant

The forward sweep is inherently sequential within a single system. For $m$ independent tridiagonal systems (common in ADI methods), assign one GPU thread per system: the $m$ sweeps are fully independent and the kernel achieves near-peak memory bandwidth.

API

num::thomas — solve a single or batched tridiagonal system

Stationary Iterative Methods

Splitting Framework

Write $A = M - N$ and iterate

\[ M x_{k+1} = N x_k + b \quad \Longrightarrow \quad x_{k+1} = \underbrace{M^{-1}N}_{B}\, x_k + M^{-1}b \]

The iteration converges from any starting point if and only if the spectral radius $\rho(B) < 1$. Decompose $A = D + L + U$ (diagonal, strict lower, strict upper):

Method	$M$	$N$
Jacobi	$D$	$-(L+U)$
Gauss-Seidel	$D+L$	$-U$
SOR( $\omega$)	$\frac{1}{\omega}D + L$	$(\frac{1}{\omega}-1)D - U$

Jacobi Iteration

Each component of the new iterate is computed from the old iterate only:

\[ x_i^{(k+1)} = \frac{1}{A_{ii}}\!\left(b_i - \sum_{j \neq i} A_{ij}\, x_j^{(k)}\right), \quad i = 0, \ldots, n-1 \]

Because all $x_i^{(k+1)}$ depend solely on $x^{(k)}$, Jacobi is embarrassingly parallel — all updates are independent.

Convergence condition: Jacobi converges under strict diagonal dominance ( $|A_{ii}| > \sum_{j \neq i}|A_{ij}|$) or for SPD $A$. The iteration matrix is $B_J = -D^{-1}(L+U)$. For the model Poisson problem (5-point stencil on an $n \times n$ grid):

\[ \rho(B_J) = \cos\!\left(\frac{\pi}{n+1}\right) \approx 1 - \frac{\pi^2}{2n^2} \]

This requires $O(n^2)$ iterations to reduce the error by $1/e$ — Jacobi is too slow as a standalone solver for large Poisson problems but remains a useful multigrid smoother.

num::jacobi — Jacobi iteration for dense or sparse $A$

Gauss-Seidel and SOR

Each component uses the most recently computed values:

\[ x_i^{(k+1)} = \frac{1}{A_{ii}}\!\left(b_i - \sum_{j < i} A_{ij}\, x_j^{(k+1)} - \sum_{j > i} A_{ij}\, x_j^{(k)}\right) \]

The in-place update propagates fresh data immediately. For SPD $A$ the Stein-Rosenberg theorem gives

\[ \rho(B_{GS}) = \rho(B_J)^2 \]

so Gauss-Seidel converges in half as many iterations as Jacobi.

SOR over-relaxes the Gauss-Seidel correction by $\omega \in (0,2)$:

\[ x_i^{(k+1)} = x_i^{(k)} + \omega\!\left(\tilde{x}_i^{(GS)} - x_i^{(k)}\right) \]

For the model Poisson problem the optimal parameter is

\[ \omega_{\mathrm{opt}} = \frac{2}{1 + \sin(\pi/(n+1))} \approx 2 - \frac{2\pi}{n} \]

which reduces the spectral radius to

\[ \rho(B_{SOR,\,\omega_{\mathrm{opt}}}) \approx 1 - \frac{4\pi}{n} \]

requiring only $O(n)$ iterations — a landmark improvement over the $O(n^2)$ of plain Gauss-Seidel (D.M. Young, 1950).

Parallelism: natural-ordering GS has a sequential dependency chain. For structured grids, red-black ordering (color node $(i,j)$ by parity of $i+j$) decouples all red updates and all black updates into two independent parallel sweeps, preserving the same convergence rate.

num::gauss_seidel — Gauss-Seidel and SOR iteration

Conjugate Gradient

Minimization Perspective

For an SPD matrix $A$, solving $Ax = b$ is equivalent to minimizing the quadratic functional

\[ \phi(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A\mathbf{x} - \mathbf{b}^T\mathbf{x} \]

since $\nabla\phi = A\mathbf{x} - \mathbf{b} = -\mathbf{r}$. The residual points toward the minimum; CG follows a sequence of search directions that span the Krylov subspace

\[ \mathcal{K}_k(A,\mathbf{r}_0) = \operatorname{span}\!\bigl\{\mathbf{r}_0,\, A\mathbf{r}_0,\, \ldots,\, A^{k-1}\mathbf{r}_0\bigr\} \]

The iterate $x_k$ is the unique minimizer of $\|\mathbf{x}-\mathbf{x}^*\|_A$ over $\mathbf{x}_0 + \mathcal{K}_k$.

A-Conjugacy

Two vectors are A-conjugate if $\mathbf{p}^T A\mathbf{q} = 0$. Minimizing $\phi$ along an A-conjugate direction leaves progress along previous directions undisturbed. CG maintains A-conjugacy using only the immediately preceding direction — a consequence of the three-term Lanczos recurrence.

Step Length and Direction Update

At iteration $k$, exact line minimization along the search direction $\mathbf{p}_k$ gives

\[ \alpha_k = \frac{\mathbf{r}_k^T\mathbf{r}_k}{\mathbf{p}_k^T A \mathbf{p}_k} \]

After updating $\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k\mathbf{p}_k$ and $\mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k A\mathbf{p}_k$, the new A-conjugate direction is

\[ \mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k\mathbf{p}_k, \qquad \beta_k = \frac{\mathbf{r}_{k+1}^T\mathbf{r}_{k+1}}{\mathbf{r}_k^T\mathbf{r}_k} \]

The dominant cost per iteration is the matrix-vector product $A\mathbf{p}_k$ — $O(n^2)$ for dense $A$, $O(\mathrm{nnz})$ for sparse.

Convergence

CG terminates in at most $n$ iterations in exact arithmetic. In practice:

\[ \frac{\|\mathbf{e}_k\|_A}{\|\mathbf{e}_0\|_A} \leq 2\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^k \]

where $\kappa = \kappa_2(A) = \lambda_{\max}/\lambda_{\min}$.

$\kappa$	Iterations for 6-digit accuracy
$10$	~7
$100$	~23
$10^4$	~230
$10^6$	~2300

Eigenvalue clustering accelerates convergence: if all eigenvalues except a small cluster lie in a single interval, CG quickly annihilates the corresponding polynomial and then converges as if the problem had only as many distinct eigenvalues as the cluster size.

Preconditioned CG (PCG)

Replace $A$ with $M^{-1}A$ where $M \approx A$ is cheap to invert. The PCG scalars change to

\[ \alpha_k = \frac{\mathbf{r}_k^T\mathbf{z}_k}{\mathbf{p}_k^T A\mathbf{p}_k}, \qquad \beta_k = \frac{\mathbf{r}_{k+1}^T\mathbf{z}_{k+1}}{\mathbf{r}_k^T\mathbf{z}_k} \]

where $\mathbf{z}_k = M^{-1}\mathbf{r}_k$ is the preconditioner solve. Common preconditioners: diagonal (Jacobi), incomplete Cholesky (IC0), and algebraic multigrid (AMG, giving $\kappa(M^{-1}A) \approx O(1)$ for Poisson-type problems).

Matrix-Free CG

cg_matfree accepts a $\mathtt{MatVecFn}$ callback instead of an explicit matrix, enabling implicit time stepping ( $A = M + \Delta t\, K$, never assembled), spectral methods (apply $A$ via FFT in $O(n\log n)$), and structured operators (Toeplitz, circulant) with $O(n)$ storage.

API

num::cg — conjugate gradient for SPD systems (dense or sparse)
num::cg_matfree — matrix-free CG with a callback matvec

GMRES

Problem and Approach

GMRES solves $Ax = b$ for any invertible $A$, including non-symmetric matrices where CG does not apply. At step $k$ it finds

\[ x_k = \operatorname*{argmin}_{x \in x_0 + \mathcal{K}_k}\|Ax - b\|_2 \]

over the same Krylov subspace $\mathcal{K}_k(A,r_0)$.

Arnoldi Relation

The Krylov basis is built by the Arnoldi process using Modified Gram-Schmidt (MGS). After $k$ steps:

\[ A V_k = V_{k+1}\bar{H}_k \]

where $V_k = [v_1 \mid \cdots \mid v_k]$ has orthonormal columns and $\bar{H}_k \in \mathbb{R}^{(k+1)\times k}$ is the upper Hessenberg matrix of MGS coefficients. The MGS step at iteration $j$ is:

\[ w \leftarrow Av_j, \qquad h_{ij} = w^T v_i, \quad w \leftarrow w - h_{ij} v_i \quad (i = 0,\ldots,j), \qquad v_{j+1} = w / \|w\| \]

The GMRES Least-Squares Problem

Using the Arnoldi relation and orthonormality of $V_{k+1}$:

\[ \|Ax_k - b\|_2 = \|\bar{H}_k y_k - \beta e_1\|_2, \qquad \beta = \|r_0\|_2 \]

This is a small $(k+1) \times k$ least-squares problem solved incrementally via Givens rotations. At step $j$ a new rotation annihilates the sub-diagonal entry $H_{j,j+1}$:

\[ \begin{pmatrix} c_j & s_j \\ -s_j & c_j \end{pmatrix} \begin{pmatrix} H_{jj} \\ H_{j,j+1} \end{pmatrix} = \begin{pmatrix} d \\ 0 \end{pmatrix}, \qquad c_j = \frac{H_{jj}}{d},\quad s_j = \frac{H_{j,j+1}}{d},\quad d = \sqrt{H_{jj}^2 + H_{j,j+1}^2} \]

The rotated right-hand side vector $g$ satisfies $|g_{j+1}| = \|r_{j+1}\|_2$, giving the residual norm **without computing $x_k$ explicitly** — used as the convergence monitor.

Once $\|r_k\|_2 < \mathrm{tol}$, back-solve the $k \times k$ upper triangular system $H_{[k]}y = g_{[k]}$ and recover $x_k = x_0 + V_k y$.

Restart: GMRES(\f$m\f$)

Full GMRES stores $k$ vectors growing without bound. **Restarted GMRES( $m$)** caps memory at $m+1$ vectors of length $n$: after $m$ Arnoldi steps, update $x_0$ from the current solution and restart.

Memory: $O(mn)$. Typical restart values: $m \in [30, 300]$.
FLOPs per restart cycle: $2mn^2$ (matvecs) $+ 2m^2n$ (MGS) $+ O(m^2)$ (Givens). For $m \ll n$ the matvec dominates.
When to prefer GMRES over CG: whenever $A$ is non-symmetric or indefinite. For SPD systems CG is cheaper (short recurrence, $O(n)$ memory per iteration vs $O(mn)$).

Convergence

GMRES convergence is governed by the polynomial approximation problem:

\[ \frac{\|r_k\|_2}{\|r_0\|_2} \leq \min_{\substack{p_k \in \mathcal{P}_k \\ p_k(0)=1}} \max_{\lambda \in \sigma(A)} |p_k(\lambda)| \]

Convergence is fast when the spectrum $\sigma(A)$ is clustered away from the origin; it can stagnate for many steps then converge suddenly if one eigenvalue is an outlier far from the main cluster.

Preconditioning replaces $A$ by $M^{-1}A$ (left) or $AM^{-1}$ (right); right preconditioning is usually preferred because it minimizes the true residual norm $\|r_k\|_2$ rather than the preconditioned norm $\|M^{-1}r_k\|_2$.

API

num::gmres — restarted GMRES( $m$) for general non-singular systems

Method	\(M\)	\(N\)
Jacobi	\(D\)	\(-(L+U)\)
Gauss-Seidel	\(D+L\)	\(-U\)
SOR( \(\omega\))	\(\frac{1}{\omega}D + L\)	\((\frac{1}{\omega}-1)D - U\)