1. Introduction
Solving linear systems \(A\mathbf{x} = \mathbf{b}\) is the computational backbone of scientific computing. This document provides a rigorous treatment of two solver methods and their parallel implementations:
- Conjugate Gradient (CG) — Krylov subspace method for SPD systems
- Thomas Algorithm — Direct solver for tridiagonal systems
2. Conjugate Gradient Method
2.1 Mathematical Foundation
The Minimization Perspective
For a symmetric positive definite (SPD) matrix \(A \in \mathbb{R}^{n \times n}\), solving \(A\mathbf{x} = \mathbf{b}\) is equivalent to minimizing the quadratic functional:
\[\phi(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x}\]
Proof: Taking the gradient:
\[\nabla \phi(\mathbf{x}) = A\mathbf{x} - \mathbf{b}\]
Setting \(\nabla \phi = \mathbf{0}\) yields \(A\mathbf{x} = \mathbf{b}\). Since \(A\) is SPD, the Hessian \(\nabla^2 \phi = A\) is positive definite, confirming this is a minimum. \(\square\)
The residual \(\mathbf{r} = \mathbf{b} - A\mathbf{x}\) equals \(-\nabla\phi\), pointing toward the minimum.
A-Conjugacy and Optimality
Two vectors \(\mathbf{p}\) and \(\mathbf{q}\) are A-conjugate (or A-orthogonal) if:
\[\mathbf{p}^T A \mathbf{q} = 0\]
Key Theorem: Given \(n\) mutually A-conjugate directions \(\{\mathbf{p}_0, \mathbf{p}_1, \ldots, \mathbf{p}_{n-1}\}\), the exact solution can be found in \(n\) steps by sequential line minimization along each direction.
Proof: The A-conjugate vectors form a basis for \(\mathbb{R}^n\). Express the error:
\[\mathbf{x}^* - \mathbf{x}_0 = \sum_{i=0}^{n-1} \alpha_i \mathbf{p}_i\]
Multiplying both sides by \(\mathbf{p}_j^T A\):
\[\mathbf{p}_j^T A (\mathbf{x}^* - \mathbf{x}_0) = \alpha_j \mathbf{p}_j^T A \mathbf{p}_j\]
Thus \(\alpha_j = \frac{\mathbf{p}_j^T A (\mathbf{x}^* - \mathbf{x}_0)}{\mathbf{p}_j^T A \mathbf{p}_j} = \frac{\mathbf{p}_j^T \mathbf{r}_0}{\mathbf{p}_j^T A \mathbf{p}_j}\)
where \(\mathbf{r}_0 = \mathbf{b} - A\mathbf{x}_0 = A(\mathbf{x}^* - \mathbf{x}_0)\). \(\square\)
Krylov Subspace Connection
The CG method generates A-conjugate directions from the Krylov subspace:
\[\mathcal{K}_k(A, \mathbf{r}_0) = \text{span}\{\mathbf{r}_0, A\mathbf{r}_0, A^2\mathbf{r}_0, \ldots, A^{k-1}\mathbf{r}_0\}\]
After \(k\) iterations, CG finds:
\[\mathbf{x}_k = \arg\min_{\mathbf{x} \in \mathbf{x}_0 + \mathcal{K}_k} \|\mathbf{x} - \mathbf{x}^*\|_A\]
where \(\|\mathbf{v}\|_A = \sqrt{\mathbf{v}^T A \mathbf{v}}\) is the A-norm (energy norm).
2.2 Algorithm Derivation
Starting from \(\mathbf{x}_0\) with residual \(\mathbf{r}_0 = \mathbf{b} - A\mathbf{x}_0\), set \(\mathbf{p}_0 = \mathbf{r}_0\).
Step size (exact line minimization): Minimize \(\phi(\mathbf{x}_k + \alpha \mathbf{p}_k)\):
\[\frac{d}{d\alpha}\phi(\mathbf{x}_k + \alpha \mathbf{p}_k) = \mathbf{p}_k^T(A\mathbf{x}_k + \alpha A\mathbf{p}_k - \mathbf{b}) = 0\]
\[\alpha_k = \frac{\mathbf{p}_k^T \mathbf{r}_k}{\mathbf{p}_k^T A \mathbf{p}_k}\]
Update:
\[\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k\]
\[\mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k A \mathbf{p}_k\]
New search direction (enforce A-conjugacy with all previous directions):
\[\mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k\]
where \(\beta_k = \frac{\mathbf{r}_{k+1}^T \mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{r}_k}\)
Remarkable property: Due to the Krylov structure, we only need the previous direction—not all previous directions—to maintain A-conjugacy.
2.3 The CG Algorithm
Input: A (SPD), b, x₀, tol, max_iter
Output: x (approximate solution)
r = b - A*x
p = r
ρ_old = rᵀr
for k = 0, 1, 2, ... until convergence:
q = A*p // Matrix-vector product
α = ρ_old / (pᵀq) // Step length
x = x + α*p // Update solution
r = r - α*q // Update residual
ρ_new = rᵀr // New residual norm squared
if √ρ_new < tol: break // Convergence check
β = ρ_new / ρ_old // Improvement ratio
p = r + β*p // New search direction
ρ_old = ρ_new
2.4 Convergence Analysis
Theorem (CG Convergence Rate): After \(k\) iterations:
\[\frac{\|\mathbf{x}_k - \mathbf{x}^*\|_A}{\|\mathbf{x}_0 - \mathbf{x}^*\|_A} \leq 2\left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^k\]
where \(\kappa = \lambda_{\max}/\lambda_{\min}\) is the condition number of \(A\).
Proof sketch: CG minimizes over polynomials \(p_k\) of degree \(k\) with \(p_k(0) = 1\):
\[\|\mathbf{x}_k - \mathbf{x}^*\|_A = \min_{p_k} \|p_k(A)(\mathbf{x}_0 - \mathbf{x}^*)\|_A\]
Using Chebyshev polynomials (optimal for minimizing \(\max_{\lambda \in [\lambda_{\min}, \lambda_{\max}]} |p_k(\lambda)|\)) yields the bound. \(\square\)
Practical implications:
| Condition Number | Iterations for 6 digits |
| \(\kappa = 10\) | ~7 |
| \(\kappa = 100\) | ~23 |
| \(\kappa = 10^4\) | ~230 |
| \(\kappa = 10^6\) | ~2300 |
2.5 Operation Count
Per iteration:
| Operation | FLOPs | Memory Access |
| \(\mathbf{q} = A\mathbf{p}\) | \(2n^2\) (dense) or \(O(nnz)\) (sparse) | \(O(n^2)\) or \(O(nnz)\) |
| \(\alpha = \rho/(\mathbf{p}^T\mathbf{q})\) | \(2n\) | \(2n\) reads |
| \(\mathbf{x} = \mathbf{x} + \alpha\mathbf{p}\) | \(2n\) | \(3n\) |
| \(\mathbf{r} = \mathbf{r} - \alpha\mathbf{q}\) | \(2n\) | \(3n\) |
| \(\rho = \mathbf{r}^T\mathbf{r}\) | \(2n\) | \(n\) reads |
| \(\mathbf{p} = \mathbf{r} + \beta\mathbf{p}\) | \(2n\) | \(3n\) |
Total per iteration: \(2n^2 + 10n\) FLOPs (dense), dominated by matvec.
3. Thomas Algorithm (Tridiagonal Solver)
3.1 Problem Structure
A tridiagonal system has the form:
\[\begin{pmatrix}
b_0 & c_0 \\
a_0 & b_1 & c_1 \\
& a_1 & b_2 & c_2 \\
& & \ddots & \ddots & \ddots \\
& & & a_{n-3} & b_{n-2} & c_{n-2} \\
& & & & a_{n-2} & b_{n-1}
\end{pmatrix}
\begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_{n-2} \\ x_{n-1} \end{pmatrix}
=
\begin{pmatrix} d_0 \\ d_1 \\ d_2 \\ \vdots \\ d_{n-2} \\ d_{n-1} \end{pmatrix}\]
Storage:
- \(\mathbf{a}\): lower diagonal, size \(n-1\) (elements \(a_0, \ldots, a_{n-2}\))
- \(\mathbf{b}\): main diagonal, size \(n\)
- \(\mathbf{c}\): upper diagonal, size \(n-1\) (elements \(c_0, \ldots, c_{n-2}\))
3.2 Derivation via LU Factorization
The Thomas algorithm is Gaussian elimination specialized for tridiagonal structure.
LU Decomposition: Factor \(A = LU\) where:
\[L = \begin{pmatrix}
1 \\
l_0 & 1 \\
& l_1 & 1 \\
& & \ddots & \ddots \\
& & & l_{n-2} & 1
\end{pmatrix}, \quad
U = \begin{pmatrix}
u_0 & c_0 \\
& u_1 & c_1 \\
& & u_2 & c_2 \\
& & & \ddots & \ddots \\
& & & & u_{n-1}
\end{pmatrix}\]
Deriving the factors: From \(A = LU\):
Row 0: \(b_0 = u_0\), so \(u_0 = b_0\)
Row \(i\) ( \(i \geq 1\)):
- Position \((i, i-1)\): \(a_{i-1} = l_{i-1} \cdot u_{i-1}\), so \(l_{i-1} = a_{i-1}/u_{i-1}\)
- Position \((i, i)\): \(b_i = l_{i-1} \cdot c_{i-1} + u_i\), so \(u_i = b_i - l_{i-1} \cdot c_{i-1}\)
Forward elimination (computing \(L^{-1}\mathbf{d}\)):
Let \(\mathbf{y} = U\mathbf{x}\), solve \(L\mathbf{y} = \mathbf{d}\):
\[y_0 = d_0\]
\[y_i = d_i - l_{i-1} \cdot y_{i-1} \quad \text{for } i = 1, \ldots, n-1\]
Back substitution (solving \(U\mathbf{x} = \mathbf{y}\)):
\[x_{n-1} = y_{n-1} / u_{n-1}\]
\[x_i = (y_i - c_i \cdot x_{i+1}) / u_i \quad \text{for } i = n-2, \ldots, 0\]
3.3 The Thomas Algorithm
Combining and simplifying (using \(\mathbf{b}\) and \(\mathbf{d}\) as working storage):
Input: a[0..n-2], b[0..n-1], c[0..n-2], d[0..n-1]
Output: x[0..n-1]
// Create working copies
b_work = copy(b)
d_work = copy(d)
// Forward elimination
for i = 1 to n-1:
w = a[i-1] / b_work[i-1]
b_work[i] = b_work[i] - w * c[i-1]
d_work[i] = d_work[i] - w * d_work[i-1]
// Back substitution
x[n-1] = d_work[n-1] / b_work[n-1]
for i = n-2 down to 0:
x[i] = (d_work[i] - c[i] * x[i+1]) / b_work[i]
3.4 Complexity Analysis
Forward elimination:
- Loop iterations: \(n-1\)
- Per iteration: 1 division, 2 multiplications, 2 subtractions = 5 FLOPs
- Total: \(5(n-1)\) FLOPs
Back substitution:
- First element: 1 division
- Remaining \(n-1\) elements: 1 subtraction, 1 multiplication, 1 division = 3 FLOPs each
- Total: \(1 + 3(n-1) = 3n - 2\) FLOPs
Grand total: \(5(n-1) + 3n - 2 = 8n - 7\) FLOPs = \(O(n)\)
This is **optimal**—any algorithm must at least read the \(O(n)\) input values.
3.5 Stability Analysis
Theorem: The Thomas algorithm is stable if \(A\) is diagonally dominant:
\[|b_i| > |a_{i-1}| + |c_i|, \quad i = 0, \ldots, n-1\]
(with \(a_{-1} = c_{n-1} = 0\)).
Proof: Under diagonal dominance, \(|w| = |a_{i-1}/b_{\text{work},i-1}| < 1\), so:
\[|b_{\text{work},i}| = |b_i - w \cdot c_{i-1}| \geq |b_i| - |w||c_{i-1}| > |c_i|\]
Thus modified diagonal elements remain dominant and bounded away from zero. \(\square\)
Example (1D Laplacian): The system with \(a_i = c_i = -1\), \(b_i = 2\) satisfies \(|2| > |-1| + |-1|\), so Thomas is stable.
4. CUDA Implementation
4.1 CG on GPU
The CG algorithm maps naturally to GPU by parallelizing each operation:
SolverResult cg(const Matrix& A, const Vector& b, Vector& x,
real tol, idx max_iter, Exec exec) {
Vector r(n), p(n), Ap(n);
if (exec == Exec::gpu) {
A.to_gpu(); b.to_gpu(); x.to_gpu();
r.to_gpu(); p.to_gpu(); Ap.to_gpu();
}
matvec(A, x, r, exec);
scale(r, -1.0, exec);
axpy(1.0, b, r, exec);
real rsold = dot(r, r, exec);
for (idx iter = 0; iter < max_iter; ++iter) {
matvec(A, p, Ap, exec);
real pAp = dot(p, Ap, exec);
real alpha = rsold / pAp;
axpy(alpha, p, x, exec);
axpy(-alpha, Ap, r, exec);
real rsnew = dot(r, r, exec);
if (sqrt(rsnew) < tol) break;
real beta = rsnew / rsold;
scale(p, beta, exec);
axpy(1.0, r, p, exec);
rsold = rsnew;
}
if (exec == Exec::gpu) x.to_cpu();
return result;
}
GPU Kernels Used
Parallel dot product (reduction):
__global__ void k_dot(const real* x, const real* y, real* result, idx n) {
__shared__ real sdata[BLOCK_SIZE];
idx tid = threadIdx.x;
idx i = blockIdx.x * blockDim.x + tid;
// Each thread computes partial product
sdata[tid] = (i < n) ? x[i] * y[i] : 0;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) sdata[tid] += sdata[tid + s];
__syncthreads();
}
// Block result added atomically to global result
if (tid == 0) atomicAdd(result, sdata[0]);
}
Parallel AXPY ( \(\mathbf{y} = \alpha\mathbf{x} + \mathbf{y}\)):
__global__ void k_axpy(real alpha, const real* x, real* y, idx n) {
idx i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] += alpha * x[i];
}
Parallel matrix-vector product:
__global__ void k_matvec(const real* A, const real* x, real* y,
idx rows, idx cols) {
idx i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < rows) {
real sum = 0;
for (idx j = 0; j < cols; ++j)
sum += A[i * cols + j] * x[j];
y[i] = sum;
}
}
Performance Considerations
| Problem Size | CPU Wins | GPU Wins |
| \(n < 500\) | ✓ (kernel overhead) | |
| \(n > 1000\) | | ✓ (parallelism) |
The crossover point depends on:
- GPU memory bandwidth
- Kernel launch latency (~5-20 μs)
- Number of CG iterations
4.2 Thomas Algorithm on GPU
The Thomas algorithm is inherently sequential (forward sweep depends on previous row). However, we can parallelize across multiple independent systems.
Batched Thomas Kernel
When solving \(k\) independent tridiagonal systems (common in ADI methods):
__global__ void k_thomas_batched(
const real* a, const real* b, const real* c,
const real* d, real* x,
real* b_work, real* d_work, // Workspace
idx n, idx batch_size)
{
// Each thread handles one complete system
idx batch_idx = blockIdx.x * blockDim.x + threadIdx.x;
if (batch_idx >= batch_size) return;
// Offset into packed arrays
idx off_sub = batch_idx * (n - 1); // For a, c
idx off_main = batch_idx * n; // For b, d, x
// Copy to workspace
for (idx i = 0; i < n; ++i) {
b_work[off_main + i] = b[off_main + i];
d_work[off_main + i] = d[off_main + i];
}
// Forward elimination (sequential within thread)
for (idx i = 1; i < n; ++i) {
real w = a[off_sub + i - 1] / b_work[off_main + i - 1];
b_work[off_main + i] -= w * c[off_sub + i - 1];
d_work[off_main + i] -= w * d_work[off_main + i - 1];
}
// Back substitution (sequential within thread)
x[off_main + n - 1] = d_work[off_main + n - 1] / b_work[off_main + n - 1];
for (idx i = n - 1; i > 0; --i) {
x[off_main + i - 1] = (d_work[off_main + i - 1]
- c[off_sub + i - 1] * x[off_main + i]) / b_work[off_main + i - 1];
}
}
Wrapper function:
void thomas_batched(const real* a, const real* b, const real* c,
const real* d, real* x, idx n, idx batch_size) {
real *b_work, *d_work;
cudaMalloc(&b_work, batch_size * n * sizeof(real));
cudaMalloc(&d_work, batch_size * n * sizeof(real));
int threads = 256;
int blocks = (batch_size + threads - 1) / threads;
k_thomas_batched<<<blocks, threads>>>(
a, b, c, d, x, b_work, d_work, n, batch_size);
cudaFree(b_work);
cudaFree(d_work);
}
When Batched Thomas Helps
| Scenario | Recommendation |
| Single system | Use CPU Thomas |
| 100+ independent systems | Use GPU batched |
| ADI with \(n_y \times n_z\) pencils | GPU batched excels |
5. MPI Implementation (Distributed Memory)
5.1 Distributed CG
For large systems that don't fit in single-node memory, distribute the matrix and vectors across \(P\) MPI ranks.
Data Distribution
1D row distribution: Rank \(p\) owns rows \([p \cdot (n/P), (p+1) \cdot (n/P))\)
Rank 0: rows 0 to n/P - 1
Rank 1: rows n/P to 2n/P - 1
...
Rank P-1: rows (P-1)n/P to n - 1
Each rank stores:
- Local portion of \(A\): size \((n/P) \times n\) (full rows)
- Local portions of \(\mathbf{x}, \mathbf{b}, \mathbf{r}, \mathbf{p}\): size \(n/P\)
Distributed Operations
Matrix-vector product ( \(\mathbf{y} = A\mathbf{x}\)):
void distributed_matvec(const Matrix& A_local, const Vector& x_local,
int rank, size;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
idx local_rows = A_local.rows();
idx n = local_rows * size;
Vector x_full(n);
MPI_Allgather(x_local.data(), local_rows, MPI_DOUBLE,
x_full.data(), local_rows, MPI_DOUBLE, comm);
for (idx i = 0; i < local_rows; ++i) {
real sum = 0;
for (idx j = 0; j < n; ++j)
sum += A_local(i, j) * x_full[j];
y_local[i] = sum;
}
}
Distributed dot product:
real distributed_dot(const Vector& x_local, const Vector& y_local,
real local_sum = 0;
for (idx i = 0; i < x_local.size(); ++i)
local_sum += x_local[i] * y_local[i];
real global_sum;
MPI_Allreduce(&local_sum, &global_sum, 1, MPI_DOUBLE, MPI_SUM, comm);
return global_sum;
}
Distributed AXPY (no communication needed):
void distributed_axpy(real alpha, const Vector& x_local, Vector& y_local) {
for (idx i = 0; i < x_local.size(); ++i)
y_local[i] += alpha * x_local[i];
}
Distributed CG Algorithm
SolverResult distributed_cg(const Matrix& A_local, const Vector& b_local,
Vector& x_local, real tol, idx max_iter,
idx local_n = b_local.size();
Vector r_local(local_n), p_local(local_n), Ap_local(local_n);
distributed_matvec(A_local, x_local, r_local, comm);
for (idx i = 0; i < local_n; ++i)
r_local[i] = b_local[i] - r_local[i];
p_local = r_local;
real rsold = distributed_dot(r_local, r_local, comm);
for (idx iter = 0; iter < max_iter; ++iter) {
distributed_matvec(A_local, p_local, Ap_local, comm);
real pAp = distributed_dot(p_local, Ap_local, comm);
real alpha = rsold / pAp;
distributed_axpy(alpha, p_local, x_local);
distributed_axpy(-alpha, Ap_local, r_local);
real rsnew = distributed_dot(r_local, r_local, comm);
if (sqrt(rsnew) < tol) {
return {iter + 1, sqrt(rsnew), true};
}
real beta = rsnew / rsold;
for (idx i = 0; i < local_n; ++i)
p_local[i] = r_local[i] + beta * p_local[i];
rsold = rsnew;
}
return {max_iter, sqrt(rsold), false};
}
Communication Analysis
Per CG iteration:
| Operation | Communication |
matvec | MPI_Allgather — \(O(n)\) data, \(O(\log P)\) latency |
dot (×2) | MPI_Allreduce — \(O(1)\) data, \(O(\log P)\) latency |
axpy (×3) | None |
Total per iteration:
- 1 Allgather: \(O(n/P \cdot P) = O(n)\) bytes
- 2 Allreduce: \(O(1)\) bytes each
Communication is dominated by the Allgather in matvec.
5.2 Distributed Thomas (Pipeline)
For a single tridiagonal system distributed across ranks, use pipelining:
Rank 0: rows 0 to n/P - 1
Rank 1: rows n/P to 2n/P - 1 (needs data from Rank 0)
...
Forward sweep: Each rank waits for modified values from previous rank:
void distributed_thomas_forward(Vector& b_local, Vector& d_local,
const Vector& a_local, const Vector& c_local,
int rank, size;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
idx local_n = b_local.size();
if (rank > 0) {
real recv_b, recv_d;
MPI_Recv(&recv_b, 1, MPI_DOUBLE, rank - 1, 0, comm, MPI_STATUS_IGNORE);
MPI_Recv(&recv_d, 1, MPI_DOUBLE, rank - 1, 1, comm, MPI_STATUS_IGNORE);
real w = a_local[0] / recv_b;
b_local[0] -= w * c_local[local_n - 1];
d_local[0] -= w * recv_d;
}
for (idx i = 1; i < local_n; ++i) {
real w = a_local[i] / b_local[i - 1];
b_local[i] -= w * c_local[i - 1];
d_local[i] -= w * d_local[i - 1];
}
if (rank < size - 1) {
MPI_Send(&b_local[local_n - 1], 1, MPI_DOUBLE, rank + 1, 0, comm);
MPI_Send(&d_local[local_n - 1], 1, MPI_DOUBLE, rank + 1, 1, comm);
}
}
Note: Pipeline parallelism has limited scalability. For better parallel efficiency, consider:
- Cyclic Reduction: \(O(\log n)\) parallel depth
- Spike Algorithm: Domain decomposition with coupling solve
6. Hybrid MPI+CUDA Implementation
For multi-GPU clusters, combine MPI (inter-node) with CUDA (intra-node):
SolverResult hybrid_cg(const Matrix& A_local, const Vector& b_local,
Vector& x_local, real tol, idx max_iter,
int rank;
MPI_Comm_rank(comm, &rank);
cudaSetDevice(rank % num_gpus_per_node);
A_local.to_gpu();
b_local.to_gpu();
x_local.to_gpu();
MPI_Allgather(x_local.gpu_data(), local_n, MPI_DOUBLE,
x_full_gpu, local_n, MPI_DOUBLE, comm);
}
CUDA-aware MPI allows direct GPU-to-GPU transfers without staging through CPU memory.
7. Summary
| Method | Complexity | Parallelization | Best Use Case |
| CG (CPU) | \(O(n^2)\)/iter | OpenMP threads | Small-medium SPD |
| CG (GPU) | \(O(n^2)\)/iter | CUDA kernels | Large SPD ( \(n > 1000\)) |
| CG (MPI) | \(O(n^2/P)\)/iter + comm | Distributed | Huge SPD |
| Thomas (CPU) | \(O(n)\) | None (sequential) | Single tridiagonal |
| Thomas Batched (GPU) | \(O(n \cdot k/P_{GPU})\) | Thread per system | Many tridiagonals |
| Thomas (MPI) | \(O(n/P)\) + pipeline | Limited | Large single tridiag |
Key insights:
- CG parallelizes naturally; matvec dominates cost
- Thomas is inherently sequential; parallelize across systems, not within
- GPU advantage requires sufficient problem size to amortize launch overhead
- MPI CG needs careful attention to communication (Allgather is expensive)