numerics
Loading...
Searching...
No Matches
Week 4: Linear Solvers — Theory and Parallel Implementation

1. Introduction

Solving linear systems \(A\mathbf{x} = \mathbf{b}\) is the computational backbone of scientific computing. This document provides a rigorous treatment of two solver methods and their parallel implementations:

  1. Conjugate Gradient (CG) — Krylov subspace method for SPD systems
  2. Thomas Algorithm — Direct solver for tridiagonal systems

2. Conjugate Gradient Method

2.1 Mathematical Foundation

The Minimization Perspective

For a symmetric positive definite (SPD) matrix \(A \in \mathbb{R}^{n \times n}\), solving \(A\mathbf{x} = \mathbf{b}\) is equivalent to minimizing the quadratic functional:

\[\phi(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x}\]

Proof: Taking the gradient:

\[\nabla \phi(\mathbf{x}) = A\mathbf{x} - \mathbf{b}\]

Setting \(\nabla \phi = \mathbf{0}\) yields \(A\mathbf{x} = \mathbf{b}\). Since \(A\) is SPD, the Hessian \(\nabla^2 \phi = A\) is positive definite, confirming this is a minimum. \(\square\)

The residual \(\mathbf{r} = \mathbf{b} - A\mathbf{x}\) equals \(-\nabla\phi\), pointing toward the minimum.

A-Conjugacy and Optimality

Two vectors \(\mathbf{p}\) and \(\mathbf{q}\) are A-conjugate (or A-orthogonal) if:

\[\mathbf{p}^T A \mathbf{q} = 0\]

Key Theorem: Given \(n\) mutually A-conjugate directions \(\{\mathbf{p}_0, \mathbf{p}_1, \ldots, \mathbf{p}_{n-1}\}\), the exact solution can be found in \(n\) steps by sequential line minimization along each direction.

Proof: The A-conjugate vectors form a basis for \(\mathbb{R}^n\). Express the error:

\[\mathbf{x}^* - \mathbf{x}_0 = \sum_{i=0}^{n-1} \alpha_i \mathbf{p}_i\]

Multiplying both sides by \(\mathbf{p}_j^T A\):

\[\mathbf{p}_j^T A (\mathbf{x}^* - \mathbf{x}_0) = \alpha_j \mathbf{p}_j^T A \mathbf{p}_j\]

Thus \(\alpha_j = \frac{\mathbf{p}_j^T A (\mathbf{x}^* - \mathbf{x}_0)}{\mathbf{p}_j^T A \mathbf{p}_j} = \frac{\mathbf{p}_j^T \mathbf{r}_0}{\mathbf{p}_j^T A \mathbf{p}_j}\)

where \(\mathbf{r}_0 = \mathbf{b} - A\mathbf{x}_0 = A(\mathbf{x}^* - \mathbf{x}_0)\). \(\square\)

Krylov Subspace Connection

The CG method generates A-conjugate directions from the Krylov subspace:

\[\mathcal{K}_k(A, \mathbf{r}_0) = \text{span}\{\mathbf{r}_0, A\mathbf{r}_0, A^2\mathbf{r}_0, \ldots, A^{k-1}\mathbf{r}_0\}\]

After \(k\) iterations, CG finds:

\[\mathbf{x}_k = \arg\min_{\mathbf{x} \in \mathbf{x}_0 + \mathcal{K}_k} \|\mathbf{x} - \mathbf{x}^*\|_A\]

where \(\|\mathbf{v}\|_A = \sqrt{\mathbf{v}^T A \mathbf{v}}\) is the A-norm (energy norm).

2.2 Algorithm Derivation

Starting from \(\mathbf{x}_0\) with residual \(\mathbf{r}_0 = \mathbf{b} - A\mathbf{x}_0\), set \(\mathbf{p}_0 = \mathbf{r}_0\).

Step size (exact line minimization): Minimize \(\phi(\mathbf{x}_k + \alpha \mathbf{p}_k)\):

\[\frac{d}{d\alpha}\phi(\mathbf{x}_k + \alpha \mathbf{p}_k) = \mathbf{p}_k^T(A\mathbf{x}_k + \alpha A\mathbf{p}_k - \mathbf{b}) = 0\]

\[\alpha_k = \frac{\mathbf{p}_k^T \mathbf{r}_k}{\mathbf{p}_k^T A \mathbf{p}_k}\]

Update:

\[\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k\]

\[\mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k A \mathbf{p}_k\]

New search direction (enforce A-conjugacy with all previous directions):

\[\mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k\]

where \(\beta_k = \frac{\mathbf{r}_{k+1}^T \mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{r}_k}\)

Remarkable property: Due to the Krylov structure, we only need the previous direction—not all previous directions—to maintain A-conjugacy.

2.3 The CG Algorithm

Input: A (SPD), b, x₀, tol, max_iter
Output: x (approximate solution)
r = b - A*x
p = r
ρ_old = rᵀr
for k = 0, 1, 2, ... until convergence:
q = A*p // Matrix-vector product
α = ρ_old / (pᵀq) // Step length
x = x + α*p // Update solution
r = r - α*q // Update residual
ρ_new = rᵀr // New residual norm squared
if √ρ_new < tol: break // Convergence check
β = ρ_new / ρ_old // Improvement ratio
p = r + β*p // New search direction
ρ_old = ρ_new

2.4 Convergence Analysis

Theorem (CG Convergence Rate): After \(k\) iterations:

\[\frac{\|\mathbf{x}_k - \mathbf{x}^*\|_A}{\|\mathbf{x}_0 - \mathbf{x}^*\|_A} \leq 2\left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^k\]

where \(\kappa = \lambda_{\max}/\lambda_{\min}\) is the condition number of \(A\).

Proof sketch: CG minimizes over polynomials \(p_k\) of degree \(k\) with \(p_k(0) = 1\):

\[\|\mathbf{x}_k - \mathbf{x}^*\|_A = \min_{p_k} \|p_k(A)(\mathbf{x}_0 - \mathbf{x}^*)\|_A\]

Using Chebyshev polynomials (optimal for minimizing \(\max_{\lambda \in [\lambda_{\min}, \lambda_{\max}]} |p_k(\lambda)|\)) yields the bound. \(\square\)

Practical implications:

Condition Number Iterations for 6 digits
\(\kappa = 10\) ~7
\(\kappa = 100\) ~23
\(\kappa = 10^4\) ~230
\(\kappa = 10^6\) ~2300

2.5 Operation Count

Per iteration:

Operation FLOPs Memory Access
\(\mathbf{q} = A\mathbf{p}\) \(2n^2\) (dense) or \(O(nnz)\) (sparse) \(O(n^2)\) or \(O(nnz)\)
\(\alpha = \rho/(\mathbf{p}^T\mathbf{q})\) \(2n\) \(2n\) reads
\(\mathbf{x} = \mathbf{x} + \alpha\mathbf{p}\) \(2n\) \(3n\)
\(\mathbf{r} = \mathbf{r} - \alpha\mathbf{q}\) \(2n\) \(3n\)
\(\rho = \mathbf{r}^T\mathbf{r}\) \(2n\) \(n\) reads
\(\mathbf{p} = \mathbf{r} + \beta\mathbf{p}\) \(2n\) \(3n\)

Total per iteration: \(2n^2 + 10n\) FLOPs (dense), dominated by matvec.


3. Thomas Algorithm (Tridiagonal Solver)

3.1 Problem Structure

A tridiagonal system has the form:

\[\begin{pmatrix} b_0 & c_0 \\ a_0 & b_1 & c_1 \\ & a_1 & b_2 & c_2 \\ & & \ddots & \ddots & \ddots \\ & & & a_{n-3} & b_{n-2} & c_{n-2} \\ & & & & a_{n-2} & b_{n-1} \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_{n-2} \\ x_{n-1} \end{pmatrix} = \begin{pmatrix} d_0 \\ d_1 \\ d_2 \\ \vdots \\ d_{n-2} \\ d_{n-1} \end{pmatrix}\]

Storage:

  • \(\mathbf{a}\): lower diagonal, size \(n-1\) (elements \(a_0, \ldots, a_{n-2}\))
  • \(\mathbf{b}\): main diagonal, size \(n\)
  • \(\mathbf{c}\): upper diagonal, size \(n-1\) (elements \(c_0, \ldots, c_{n-2}\))

3.2 Derivation via LU Factorization

The Thomas algorithm is Gaussian elimination specialized for tridiagonal structure.

LU Decomposition: Factor \(A = LU\) where:

\[L = \begin{pmatrix} 1 \\ l_0 & 1 \\ & l_1 & 1 \\ & & \ddots & \ddots \\ & & & l_{n-2} & 1 \end{pmatrix}, \quad U = \begin{pmatrix} u_0 & c_0 \\ & u_1 & c_1 \\ & & u_2 & c_2 \\ & & & \ddots & \ddots \\ & & & & u_{n-1} \end{pmatrix}\]

Deriving the factors: From \(A = LU\):

Row 0: \(b_0 = u_0\), so \(u_0 = b_0\)

Row \(i\) ( \(i \geq 1\)):

  • Position \((i, i-1)\): \(a_{i-1} = l_{i-1} \cdot u_{i-1}\), so \(l_{i-1} = a_{i-1}/u_{i-1}\)
  • Position \((i, i)\): \(b_i = l_{i-1} \cdot c_{i-1} + u_i\), so \(u_i = b_i - l_{i-1} \cdot c_{i-1}\)

Forward elimination (computing \(L^{-1}\mathbf{d}\)):

Let \(\mathbf{y} = U\mathbf{x}\), solve \(L\mathbf{y} = \mathbf{d}\):

\[y_0 = d_0\]

\[y_i = d_i - l_{i-1} \cdot y_{i-1} \quad \text{for } i = 1, \ldots, n-1\]

Back substitution (solving \(U\mathbf{x} = \mathbf{y}\)):

\[x_{n-1} = y_{n-1} / u_{n-1}\]

\[x_i = (y_i - c_i \cdot x_{i+1}) / u_i \quad \text{for } i = n-2, \ldots, 0\]

3.3 The Thomas Algorithm

Combining and simplifying (using \(\mathbf{b}\) and \(\mathbf{d}\) as working storage):

Input: a[0..n-2], b[0..n-1], c[0..n-2], d[0..n-1]
Output: x[0..n-1]
// Create working copies
b_work = copy(b)
d_work = copy(d)
// Forward elimination
for i = 1 to n-1:
w = a[i-1] / b_work[i-1]
b_work[i] = b_work[i] - w * c[i-1]
d_work[i] = d_work[i] - w * d_work[i-1]
// Back substitution
x[n-1] = d_work[n-1] / b_work[n-1]
for i = n-2 down to 0:
x[i] = (d_work[i] - c[i] * x[i+1]) / b_work[i]

3.4 Complexity Analysis

Forward elimination:

  • Loop iterations: \(n-1\)
  • Per iteration: 1 division, 2 multiplications, 2 subtractions = 5 FLOPs
  • Total: \(5(n-1)\) FLOPs

Back substitution:

  • First element: 1 division
  • Remaining \(n-1\) elements: 1 subtraction, 1 multiplication, 1 division = 3 FLOPs each
  • Total: \(1 + 3(n-1) = 3n - 2\) FLOPs

Grand total: \(5(n-1) + 3n - 2 = 8n - 7\) FLOPs = \(O(n)\)

This is **optimal**—any algorithm must at least read the \(O(n)\) input values.

3.5 Stability Analysis

Theorem: The Thomas algorithm is stable if \(A\) is diagonally dominant:

\[|b_i| > |a_{i-1}| + |c_i|, \quad i = 0, \ldots, n-1\]

(with \(a_{-1} = c_{n-1} = 0\)).

Proof: Under diagonal dominance, \(|w| = |a_{i-1}/b_{\text{work},i-1}| < 1\), so:

\[|b_{\text{work},i}| = |b_i - w \cdot c_{i-1}| \geq |b_i| - |w||c_{i-1}| > |c_i|\]

Thus modified diagonal elements remain dominant and bounded away from zero. \(\square\)

Example (1D Laplacian): The system with \(a_i = c_i = -1\), \(b_i = 2\) satisfies \(|2| > |-1| + |-1|\), so Thomas is stable.


4. CUDA Implementation

4.1 CG on GPU

The CG algorithm maps naturally to GPU by parallelizing each operation:

SolverResult cg(const Matrix& A, const Vector& b, Vector& x,
real tol, idx max_iter, Exec exec) {
Vector r(n), p(n), Ap(n);
if (exec == Exec::gpu) {
// Transfer data to GPU
A.to_gpu(); b.to_gpu(); x.to_gpu();
r.to_gpu(); p.to_gpu(); Ap.to_gpu();
}
// r = b - A*x
matvec(A, x, r, exec); // GPU: parallel matvec kernel
// r = b - r (combined via axpy)
scale(r, -1.0, exec); // GPU: parallel scale kernel
axpy(1.0, b, r, exec); // GPU: parallel axpy kernel
// p = r (GPU: device-to-device copy)
real rsold = dot(r, r, exec); // GPU: parallel reduction
for (idx iter = 0; iter < max_iter; ++iter) {
matvec(A, p, Ap, exec); // Dominant cost
real pAp = dot(p, Ap, exec);
real alpha = rsold / pAp;
axpy(alpha, p, x, exec); // x += alpha*p
axpy(-alpha, Ap, r, exec); // r -= alpha*Ap
real rsnew = dot(r, r, exec);
if (sqrt(rsnew) < tol) break;
real beta = rsnew / rsold;
scale(p, beta, exec); // p = beta*p
axpy(1.0, r, p, exec); // p += r
rsold = rsnew;
}
if (exec == Exec::gpu) x.to_cpu();
return result;
}

GPU Kernels Used

Parallel dot product (reduction):

__global__ void k_dot(const real* x, const real* y, real* result, idx n) {
__shared__ real sdata[BLOCK_SIZE];
idx tid = threadIdx.x;
idx i = blockIdx.x * blockDim.x + tid;
// Each thread computes partial product
sdata[tid] = (i < n) ? x[i] * y[i] : 0;
__syncthreads();
// Tree reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) sdata[tid] += sdata[tid + s];
__syncthreads();
}
// Block result added atomically to global result
if (tid == 0) atomicAdd(result, sdata[0]);
}

Parallel AXPY ( \(\mathbf{y} = \alpha\mathbf{x} + \mathbf{y}\)):

__global__ void k_axpy(real alpha, const real* x, real* y, idx n) {
idx i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] += alpha * x[i];
}

Parallel matrix-vector product:

__global__ void k_matvec(const real* A, const real* x, real* y,
idx rows, idx cols) {
idx i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < rows) {
real sum = 0;
for (idx j = 0; j < cols; ++j)
sum += A[i * cols + j] * x[j];
y[i] = sum;
}
}

Performance Considerations

Problem Size CPU Wins GPU Wins
\(n < 500\) ✓ (kernel overhead)
\(n > 1000\) ✓ (parallelism)

The crossover point depends on:

  • GPU memory bandwidth
  • Kernel launch latency (~5-20 μs)
  • Number of CG iterations

4.2 Thomas Algorithm on GPU

The Thomas algorithm is inherently sequential (forward sweep depends on previous row). However, we can parallelize across multiple independent systems.

Batched Thomas Kernel

When solving \(k\) independent tridiagonal systems (common in ADI methods):

__global__ void k_thomas_batched(
const real* a, const real* b, const real* c,
const real* d, real* x,
real* b_work, real* d_work, // Workspace
idx n, idx batch_size)
{
// Each thread handles one complete system
idx batch_idx = blockIdx.x * blockDim.x + threadIdx.x;
if (batch_idx >= batch_size) return;
// Offset into packed arrays
idx off_sub = batch_idx * (n - 1); // For a, c
idx off_main = batch_idx * n; // For b, d, x
// Copy to workspace
for (idx i = 0; i < n; ++i) {
b_work[off_main + i] = b[off_main + i];
d_work[off_main + i] = d[off_main + i];
}
// Forward elimination (sequential within thread)
for (idx i = 1; i < n; ++i) {
real w = a[off_sub + i - 1] / b_work[off_main + i - 1];
b_work[off_main + i] -= w * c[off_sub + i - 1];
d_work[off_main + i] -= w * d_work[off_main + i - 1];
}
// Back substitution (sequential within thread)
x[off_main + n - 1] = d_work[off_main + n - 1] / b_work[off_main + n - 1];
for (idx i = n - 1; i > 0; --i) {
x[off_main + i - 1] = (d_work[off_main + i - 1]
- c[off_sub + i - 1] * x[off_main + i]) / b_work[off_main + i - 1];
}
}

Wrapper function:

void thomas_batched(const real* a, const real* b, const real* c,
const real* d, real* x, idx n, idx batch_size) {
real *b_work, *d_work;
cudaMalloc(&b_work, batch_size * n * sizeof(real));
cudaMalloc(&d_work, batch_size * n * sizeof(real));
int threads = 256;
int blocks = (batch_size + threads - 1) / threads;
k_thomas_batched<<<blocks, threads>>>(
a, b, c, d, x, b_work, d_work, n, batch_size);
cudaFree(b_work);
cudaFree(d_work);
}

When Batched Thomas Helps

Scenario Recommendation
Single system Use CPU Thomas
100+ independent systems Use GPU batched
ADI with \(n_y \times n_z\) pencils GPU batched excels

5. MPI Implementation (Distributed Memory)

5.1 Distributed CG

For large systems that don't fit in single-node memory, distribute the matrix and vectors across \(P\) MPI ranks.

Data Distribution

1D row distribution: Rank \(p\) owns rows \([p \cdot (n/P), (p+1) \cdot (n/P))\)

Rank 0: rows 0 to n/P - 1
Rank 1: rows n/P to 2n/P - 1
...
Rank P-1: rows (P-1)n/P to n - 1

Each rank stores:

  • Local portion of \(A\): size \((n/P) \times n\) (full rows)
  • Local portions of \(\mathbf{x}, \mathbf{b}, \mathbf{r}, \mathbf{p}\): size \(n/P\)

Distributed Operations

Matrix-vector product ( \(\mathbf{y} = A\mathbf{x}\)):

void distributed_matvec(const Matrix& A_local, const Vector& x_local,
Vector& y_local, MPI_Comm comm) {
int rank, size;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
idx local_rows = A_local.rows();
idx n = local_rows * size; // Total size
// Gather full x vector (each rank needs all of x)
Vector x_full(n);
MPI_Allgather(x_local.data(), local_rows, MPI_DOUBLE,
x_full.data(), local_rows, MPI_DOUBLE, comm);
// Local matvec (each rank computes its rows)
for (idx i = 0; i < local_rows; ++i) {
real sum = 0;
for (idx j = 0; j < n; ++j)
sum += A_local(i, j) * x_full[j];
y_local[i] = sum;
}
}
int MPI_Comm
Definition mpi_ops.hpp:11

Distributed dot product:

real distributed_dot(const Vector& x_local, const Vector& y_local,
MPI_Comm comm) {
// Local partial sum
real local_sum = 0;
for (idx i = 0; i < x_local.size(); ++i)
local_sum += x_local[i] * y_local[i];
// Global reduction
real global_sum;
MPI_Allreduce(&local_sum, &global_sum, 1, MPI_DOUBLE, MPI_SUM, comm);
return global_sum;
}

Distributed AXPY (no communication needed):

void distributed_axpy(real alpha, const Vector& x_local, Vector& y_local) {
for (idx i = 0; i < x_local.size(); ++i)
y_local[i] += alpha * x_local[i];
}

Distributed CG Algorithm

SolverResult distributed_cg(const Matrix& A_local, const Vector& b_local,
Vector& x_local, real tol, idx max_iter,
MPI_Comm comm) {
idx local_n = b_local.size();
Vector r_local(local_n), p_local(local_n), Ap_local(local_n);
// r = b - A*x
distributed_matvec(A_local, x_local, r_local, comm);
for (idx i = 0; i < local_n; ++i)
r_local[i] = b_local[i] - r_local[i];
// p = r
p_local = r_local;
real rsold = distributed_dot(r_local, r_local, comm);
for (idx iter = 0; iter < max_iter; ++iter) {
distributed_matvec(A_local, p_local, Ap_local, comm);
real pAp = distributed_dot(p_local, Ap_local, comm);
real alpha = rsold / pAp;
distributed_axpy(alpha, p_local, x_local);
distributed_axpy(-alpha, Ap_local, r_local);
real rsnew = distributed_dot(r_local, r_local, comm);
if (sqrt(rsnew) < tol) {
return {iter + 1, sqrt(rsnew), true};
}
real beta = rsnew / rsold;
for (idx i = 0; i < local_n; ++i)
p_local[i] = r_local[i] + beta * p_local[i];
rsold = rsnew;
}
return {max_iter, sqrt(rsold), false};
}

Communication Analysis

Per CG iteration:

Operation Communication
matvec MPI_Allgather — \(O(n)\) data, \(O(\log P)\) latency
dot (×2) MPI_Allreduce — \(O(1)\) data, \(O(\log P)\) latency
axpy (×3) None

Total per iteration:

  • 1 Allgather: \(O(n/P \cdot P) = O(n)\) bytes
  • 2 Allreduce: \(O(1)\) bytes each

Communication is dominated by the Allgather in matvec.

5.2 Distributed Thomas (Pipeline)

For a single tridiagonal system distributed across ranks, use pipelining:

Rank 0: rows 0 to n/P - 1
Rank 1: rows n/P to 2n/P - 1 (needs data from Rank 0)
...

Forward sweep: Each rank waits for modified values from previous rank:

void distributed_thomas_forward(Vector& b_local, Vector& d_local,
const Vector& a_local, const Vector& c_local,
MPI_Comm comm) {
int rank, size;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
idx local_n = b_local.size();
// Receive boundary data from previous rank
if (rank > 0) {
real recv_b, recv_d;
MPI_Recv(&recv_b, 1, MPI_DOUBLE, rank - 1, 0, comm, MPI_STATUS_IGNORE);
MPI_Recv(&recv_d, 1, MPI_DOUBLE, rank - 1, 1, comm, MPI_STATUS_IGNORE);
// First local row depends on last row of previous rank
real w = a_local[0] / recv_b; // a_local[0] connects to previous rank
b_local[0] -= w * c_local[local_n - 1]; // Approximate; actual indexing depends on layout
d_local[0] -= w * recv_d;
}
// Local forward elimination
for (idx i = 1; i < local_n; ++i) {
real w = a_local[i] / b_local[i - 1];
b_local[i] -= w * c_local[i - 1];
d_local[i] -= w * d_local[i - 1];
}
// Send boundary data to next rank
if (rank < size - 1) {
MPI_Send(&b_local[local_n - 1], 1, MPI_DOUBLE, rank + 1, 0, comm);
MPI_Send(&d_local[local_n - 1], 1, MPI_DOUBLE, rank + 1, 1, comm);
}
}

Note: Pipeline parallelism has limited scalability. For better parallel efficiency, consider:

  • Cyclic Reduction: \(O(\log n)\) parallel depth
  • Spike Algorithm: Domain decomposition with coupling solve

6. Hybrid MPI+CUDA Implementation

For multi-GPU clusters, combine MPI (inter-node) with CUDA (intra-node):

SolverResult hybrid_cg(const Matrix& A_local, const Vector& b_local,
Vector& x_local, real tol, idx max_iter,
MPI_Comm comm) {
// Each MPI rank has one GPU
int rank;
MPI_Comm_rank(comm, &rank);
cudaSetDevice(rank % num_gpus_per_node);
// Transfer local data to GPU
A_local.to_gpu();
b_local.to_gpu();
x_local.to_gpu();
// ... CG iterations using:
// - GPU kernels for local operations
// - CUDA-aware MPI for communication
// Example: GPU-aware Allgather
MPI_Allgather(x_local.gpu_data(), local_n, MPI_DOUBLE,
x_full_gpu, local_n, MPI_DOUBLE, comm);
}

CUDA-aware MPI allows direct GPU-to-GPU transfers without staging through CPU memory.


7. Summary

Method Complexity Parallelization Best Use Case
CG (CPU) \(O(n^2)\)/iter OpenMP threads Small-medium SPD
CG (GPU) \(O(n^2)\)/iter CUDA kernels Large SPD ( \(n > 1000\))
CG (MPI) \(O(n^2/P)\)/iter + comm Distributed Huge SPD
Thomas (CPU) \(O(n)\) None (sequential) Single tridiagonal
Thomas Batched (GPU) \(O(n \cdot k/P_{GPU})\) Thread per system Many tridiagonals
Thomas (MPI) \(O(n/P)\) + pipeline Limited Large single tridiag

Key insights:

  1. CG parallelizes naturally; matvec dominates cost
  2. Thomas is inherently sequential; parallelize across systems, not within
  3. GPU advantage requires sufficient problem size to amortize launch overhead
  4. MPI CG needs careful attention to communication (Allgather is expensive)