1. Introduction

Solving linear systems \(A\mathbf{x} = \mathbf{b}\) is the computational backbone of scientific computing. This document provides a rigorous treatment of two solver methods and their parallel implementations:

Conjugate Gradient (CG) — Krylov subspace method for SPD systems
Thomas Algorithm — Direct solver for tridiagonal systems

2. Conjugate Gradient Method

2.1 Mathematical Foundation

The Minimization Perspective

For a symmetric positive definite (SPD) matrix \(A \in \mathbb{R}^{n \times n}\), solving \(A\mathbf{x} = \mathbf{b}\) is equivalent to minimizing the quadratic functional:

\[\phi(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T A \mathbf{x} - \mathbf{b}^T \mathbf{x}\]

Proof: Taking the gradient:

\[\nabla \phi(\mathbf{x}) = A\mathbf{x} - \mathbf{b}\]

Setting \(\nabla \phi = \mathbf{0}\) yields \(A\mathbf{x} = \mathbf{b}\). Since \(A\) is SPD, the Hessian \(\nabla^2 \phi = A\) is positive definite, confirming this is a minimum. \(\square\)

The residual \(\mathbf{r} = \mathbf{b} - A\mathbf{x}\) equals \(-\nabla\phi\), pointing toward the minimum.

A-Conjugacy and Optimality

Two vectors \(\mathbf{p}\) and \(\mathbf{q}\) are A-conjugate (or A-orthogonal) if:

\[\mathbf{p}^T A \mathbf{q} = 0\]

Key Theorem: Given \(n\) mutually A-conjugate directions \(\{\mathbf{p}_0, \mathbf{p}_1, \ldots, \mathbf{p}_{n-1}\}\), the exact solution can be found in \(n\) steps by sequential line minimization along each direction.

Proof: The A-conjugate vectors form a basis for \(\mathbb{R}^n\). Express the error:

\[\mathbf{x}^* - \mathbf{x}_0 = \sum_{i=0}^{n-1} \alpha_i \mathbf{p}_i\]

Multiplying both sides by \(\mathbf{p}_j^T A\):

\[\mathbf{p}_j^T A (\mathbf{x}^* - \mathbf{x}_0) = \alpha_j \mathbf{p}_j^T A \mathbf{p}_j\]

Thus \(\alpha_j = \frac{\mathbf{p}_j^T A (\mathbf{x}^* - \mathbf{x}_0)}{\mathbf{p}_j^T A \mathbf{p}_j} = \frac{\mathbf{p}_j^T \mathbf{r}_0}{\mathbf{p}_j^T A \mathbf{p}_j}\)

where \(\mathbf{r}_0 = \mathbf{b} - A\mathbf{x}_0 = A(\mathbf{x}^* - \mathbf{x}_0)\). \(\square\)

Krylov Subspace Connection

The CG method generates A-conjugate directions from the Krylov subspace:

\[\mathcal{K}_k(A, \mathbf{r}_0) = \text{span}\{\mathbf{r}_0, A\mathbf{r}_0, A^2\mathbf{r}_0, \ldots, A^{k-1}\mathbf{r}_0\}\]

After \(k\) iterations, CG finds:

\[\mathbf{x}_k = \arg\min_{\mathbf{x} \in \mathbf{x}_0 + \mathcal{K}_k} \|\mathbf{x} - \mathbf{x}^*\|_A\]

where \(\|\mathbf{v}\|_A = \sqrt{\mathbf{v}^T A \mathbf{v}}\) is the A-norm (energy norm).

2.2 Algorithm Derivation

Starting from \(\mathbf{x}_0\) with residual \(\mathbf{r}_0 = \mathbf{b} - A\mathbf{x}_0\), set \(\mathbf{p}_0 = \mathbf{r}_0\).

Step size (exact line minimization): Minimize \(\phi(\mathbf{x}_k + \alpha \mathbf{p}_k)\):

\[\frac{d}{d\alpha}\phi(\mathbf{x}_k + \alpha \mathbf{p}_k) = \mathbf{p}_k^T(A\mathbf{x}_k + \alpha A\mathbf{p}_k - \mathbf{b}) = 0\]

\[\alpha_k = \frac{\mathbf{p}_k^T \mathbf{r}_k}{\mathbf{p}_k^T A \mathbf{p}_k}\]

Update:

\[\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k\]

\[\mathbf{r}_{k+1} = \mathbf{r}_k - \alpha_k A \mathbf{p}_k\]

New search direction (enforce A-conjugacy with all previous directions):

\[\mathbf{p}_{k+1} = \mathbf{r}_{k+1} + \beta_k \mathbf{p}_k\]

where \(\beta_k = \frac{\mathbf{r}_{k+1}^T \mathbf{r}_{k+1}}{\mathbf{r}_k^T \mathbf{r}_k}\)

Remarkable property: Due to the Krylov structure, we only need the previous direction—not all previous directions—to maintain A-conjugacy.

2.3 The CG Algorithm

Input: A (SPD), b, x₀, tol, max_iter
Output: x (approximate solution)
 
r = b - A*x
p = r
ρ_old = rᵀr
 
for k = 0, 1, 2, ... until convergence:
    q = A*p                    // Matrix-vector product
    α = ρ_old / (pᵀq)         // Step length
    x = x + α*p               // Update solution
    r = r - α*q               // Update residual
    ρ_new = rᵀr               // New residual norm squared
 
    if √ρ_new < tol: break    // Convergence check
 
    β = ρ_new / ρ_old         // Improvement ratio
    p = r + β*p               // New search direction
    ρ_old = ρ_new

2.4 Convergence Analysis

Theorem (CG Convergence Rate): After \(k\) iterations:

\[\frac{\|\mathbf{x}_k - \mathbf{x}^*\|_A}{\|\mathbf{x}_0 - \mathbf{x}^*\|_A} \leq 2\left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^k\]

where \(\kappa = \lambda_{\max}/\lambda_{\min}\) is the condition number of \(A\).

Proof sketch: CG minimizes over polynomials \(p_k\) of degree \(k\) with \(p_k(0) = 1\):

\[\|\mathbf{x}_k - \mathbf{x}^*\|_A = \min_{p_k} \|p_k(A)(\mathbf{x}_0 - \mathbf{x}^*)\|_A\]

Using Chebyshev polynomials (optimal for minimizing \(\max_{\lambda \in [\lambda_{\min}, \lambda_{\max}]} |p_k(\lambda)|\)) yields the bound. \(\square\)

Practical implications:

Condition Number	Iterations for 6 digits
\(\kappa = 10\)	~7
\(\kappa = 100\)	~23
\(\kappa = 10^4\)	~230
\(\kappa = 10^6\)	~2300

2.5 Operation Count

Per iteration:

Operation	FLOPs	Memory Access
\(\mathbf{q} = A\mathbf{p}\)	\(2n^2\) (dense) or \(O(nnz)\) (sparse)	\(O(n^2)\) or \(O(nnz)\)
\(\alpha = \rho/(\mathbf{p}^T\mathbf{q})\)	\(2n\)	\(2n\) reads
\(\mathbf{x} = \mathbf{x} + \alpha\mathbf{p}\)	\(2n\)	\(3n\)
\(\mathbf{r} = \mathbf{r} - \alpha\mathbf{q}\)	\(2n\)	\(3n\)
\(\rho = \mathbf{r}^T\mathbf{r}\)	\(2n\)	\(n\) reads
\(\mathbf{p} = \mathbf{r} + \beta\mathbf{p}\)	\(2n\)	\(3n\)

Total per iteration: \(2n^2 + 10n\) FLOPs (dense), dominated by matvec.

3. Thomas Algorithm (Tridiagonal Solver)

3.1 Problem Structure

A tridiagonal system has the form:

\[\begin{pmatrix} b_0 & c_0 \\ a_0 & b_1 & c_1 \\ & a_1 & b_2 & c_2 \\ & & \ddots & \ddots & \ddots \\ & & & a_{n-3} & b_{n-2} & c_{n-2} \\ & & & & a_{n-2} & b_{n-1} \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_{n-2} \\ x_{n-1} \end{pmatrix} = \begin{pmatrix} d_0 \\ d_1 \\ d_2 \\ \vdots \\ d_{n-2} \\ d_{n-1} \end{pmatrix}\]

Storage:

\(\mathbf{a}\): lower diagonal, size \(n-1\) (elements \(a_0, \ldots, a_{n-2}\))
\(\mathbf{b}\): main diagonal, size \(n\)
\(\mathbf{c}\): upper diagonal, size \(n-1\) (elements \(c_0, \ldots, c_{n-2}\))

3.2 Derivation via LU Factorization

The Thomas algorithm is Gaussian elimination specialized for tridiagonal structure.

LU Decomposition: Factor \(A = LU\) where:

\[L = \begin{pmatrix} 1 \\ l_0 & 1 \\ & l_1 & 1 \\ & & \ddots & \ddots \\ & & & l_{n-2} & 1 \end{pmatrix}, \quad U = \begin{pmatrix} u_0 & c_0 \\ & u_1 & c_1 \\ & & u_2 & c_2 \\ & & & \ddots & \ddots \\ & & & & u_{n-1} \end{pmatrix}\]

Deriving the factors: From \(A = LU\):

Row 0: \(b_0 = u_0\), so \(u_0 = b_0\)

Row \(i\) ( \(i \geq 1\)):

Position \((i, i-1)\): \(a_{i-1} = l_{i-1} \cdot u_{i-1}\), so \(l_{i-1} = a_{i-1}/u_{i-1}\)
Position \((i, i)\): \(b_i = l_{i-1} \cdot c_{i-1} + u_i\), so \(u_i = b_i - l_{i-1} \cdot c_{i-1}\)

Forward elimination (computing \(L^{-1}\mathbf{d}\)):

Let \(\mathbf{y} = U\mathbf{x}\), solve \(L\mathbf{y} = \mathbf{d}\):

\[y_0 = d_0\]

\[y_i = d_i - l_{i-1} \cdot y_{i-1} \quad \text{for } i = 1, \ldots, n-1\]

Back substitution (solving \(U\mathbf{x} = \mathbf{y}\)):

\[x_{n-1} = y_{n-1} / u_{n-1}\]

\[x_i = (y_i - c_i \cdot x_{i+1}) / u_i \quad \text{for } i = n-2, \ldots, 0\]

3.3 The Thomas Algorithm

Combining and simplifying (using \(\mathbf{b}\) and \(\mathbf{d}\) as working storage):

Input: a[0..n-2], b[0..n-1], c[0..n-2], d[0..n-1]
Output: x[0..n-1]
 
// Create working copies
b_work = copy(b)
d_work = copy(d)
 
// Forward elimination
for i = 1 to n-1:
    w = a[i-1] / b_work[i-1]
    b_work[i] = b_work[i] - w * c[i-1]
    d_work[i] = d_work[i] - w * d_work[i-1]
 
// Back substitution
x[n-1] = d_work[n-1] / b_work[n-1]
for i = n-2 down to 0:
    x[i] = (d_work[i] - c[i] * x[i+1]) / b_work[i]

3.4 Complexity Analysis

Forward elimination:

Loop iterations: \(n-1\)
Per iteration: 1 division, 2 multiplications, 2 subtractions = 5 FLOPs
Total: \(5(n-1)\) FLOPs

Back substitution:

First element: 1 division
Remaining \(n-1\) elements: 1 subtraction, 1 multiplication, 1 division = 3 FLOPs each
Total: \(1 + 3(n-1) = 3n - 2\) FLOPs

Grand total: \(5(n-1) + 3n - 2 = 8n - 7\) FLOPs = \(O(n)\)

This is **optimal**—any algorithm must at least read the \(O(n)\) input values.

3.5 Stability Analysis

Theorem: The Thomas algorithm is stable if \(A\) is diagonally dominant:

\[|b_i| > |a_{i-1}| + |c_i|, \quad i = 0, \ldots, n-1\]

(with \(a_{-1} = c_{n-1} = 0\)).

Proof: Under diagonal dominance, \(|w| = |a_{i-1}/b_{\text{work},i-1}| < 1\), so:

\[|b_{\text{work},i}| = |b_i - w \cdot c_{i-1}| \geq |b_i| - |w||c_{i-1}| > |c_i|\]

Thus modified diagonal elements remain dominant and bounded away from zero. \(\square\)

Example (1D Laplacian): The system with \(a_i = c_i = -1\), \(b_i = 2\) satisfies \(|2| > |-1| + |-1|\), so Thomas is stable.

4. CUDA Implementation

4.1 CG on GPU

The CG algorithm maps naturally to GPU by parallelizing each operation:

SolverResult cg(const Matrix& A, const Vector& b, Vector& x,
                real tol, idx max_iter, Exec exec) {
 
    Vector r(n), p(n), Ap(n);
 
    if (exec == Exec::gpu) {
        // Transfer data to GPU
        A.to_gpu(); b.to_gpu(); x.to_gpu();
        r.to_gpu(); p.to_gpu(); Ap.to_gpu();
    }
 
    // r = b - A*x
    matvec(A, x, r, exec);           // GPU: parallel matvec kernel
    // r = b - r (combined via axpy)
    scale(r, -1.0, exec);            // GPU: parallel scale kernel
    axpy(1.0, b, r, exec);           // GPU: parallel axpy kernel
 
    // p = r (GPU: device-to-device copy)
 
    real rsold = dot(r, r, exec);    // GPU: parallel reduction
 
    for (idx iter = 0; iter < max_iter; ++iter) {
        matvec(A, p, Ap, exec);      // Dominant cost
 
        real pAp = dot(p, Ap, exec);
        real alpha = rsold / pAp;
 
        axpy(alpha, p, x, exec);     // x += alpha*p
        axpy(-alpha, Ap, r, exec);   // r -= alpha*Ap
 
        real rsnew = dot(r, r, exec);
        if (sqrt(rsnew) < tol) break;
 
        real beta = rsnew / rsold;
        scale(p, beta, exec);        // p = beta*p
        axpy(1.0, r, p, exec);       // p += r
 
        rsold = rsnew;
    }
 
    if (exec == Exec::gpu) x.to_cpu();
    return result;
}

GPU Kernels Used

Parallel dot product (reduction):

__global__ void k_dot(const real* x, const real* y, real* result, idx n) {
    __shared__ real sdata[BLOCK_SIZE];
    idx tid = threadIdx.x;
    idx i = blockIdx.x * blockDim.x + tid;
 
    // Each thread computes partial product
    sdata[tid] = (i < n) ? x[i] * y[i] : 0;
    __syncthreads();
 
    // Tree reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }
 
    // Block result added atomically to global result
    if (tid == 0) atomicAdd(result, sdata[0]);
}

Parallel AXPY ( \(\mathbf{y} = \alpha\mathbf{x} + \mathbf{y}\)):

__global__ void k_axpy(real alpha, const real* x, real* y, idx n) {
    idx i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) y[i] += alpha * x[i];
}

Parallel matrix-vector product:

__global__ void k_matvec(const real* A, const real* x, real* y,
                         idx rows, idx cols) {
    idx i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < rows) {
        real sum = 0;
        for (idx j = 0; j < cols; ++j)
            sum += A[i * cols + j] * x[j];
        y[i] = sum;
    }
}

Performance Considerations

Problem Size	CPU Wins	GPU Wins
\(n < 500\)	✓ (kernel overhead)
\(n > 1000\)		✓ (parallelism)

The crossover point depends on:

GPU memory bandwidth
Kernel launch latency (~5-20 μs)
Number of CG iterations

4.2 Thomas Algorithm on GPU

The Thomas algorithm is inherently sequential (forward sweep depends on previous row). However, we can parallelize across multiple independent systems.

Batched Thomas Kernel

When solving \(k\) independent tridiagonal systems (common in ADI methods):

__global__ void k_thomas_batched(
    const real* a, const real* b, const real* c,
    const real* d, real* x,
    real* b_work, real* d_work,  // Workspace
    idx n, idx batch_size)
{
    // Each thread handles one complete system
    idx batch_idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (batch_idx >= batch_size) return;
 
    // Offset into packed arrays
    idx off_sub = batch_idx * (n - 1);   // For a, c
    idx off_main = batch_idx * n;         // For b, d, x
 
    // Copy to workspace
    for (idx i = 0; i < n; ++i) {
        b_work[off_main + i] = b[off_main + i];
        d_work[off_main + i] = d[off_main + i];
    }
 
    // Forward elimination (sequential within thread)
    for (idx i = 1; i < n; ++i) {
        real w = a[off_sub + i - 1] / b_work[off_main + i - 1];
        b_work[off_main + i] -= w * c[off_sub + i - 1];
        d_work[off_main + i] -= w * d_work[off_main + i - 1];
    }
 
    // Back substitution (sequential within thread)
    x[off_main + n - 1] = d_work[off_main + n - 1] / b_work[off_main + n - 1];
    for (idx i = n - 1; i > 0; --i) {
        x[off_main + i - 1] = (d_work[off_main + i - 1]
            - c[off_sub + i - 1] * x[off_main + i]) / b_work[off_main + i - 1];
    }
}

Wrapper function:

void thomas_batched(const real* a, const real* b, const real* c,
                    const real* d, real* x, idx n, idx batch_size) {
    real *b_work, *d_work;
    cudaMalloc(&b_work, batch_size * n * sizeof(real));
    cudaMalloc(&d_work, batch_size * n * sizeof(real));
 
    int threads = 256;
    int blocks = (batch_size + threads - 1) / threads;
 
    k_thomas_batched<<<blocks, threads>>>(
        a, b, c, d, x, b_work, d_work, n, batch_size);
 
    cudaFree(b_work);
    cudaFree(d_work);
}

When Batched Thomas Helps

Scenario	Recommendation
Single system	Use CPU Thomas
100+ independent systems	Use GPU batched
ADI with \(n_y \times n_z\) pencils	GPU batched excels

5. MPI Implementation (Distributed Memory)

5.1 Distributed CG

For large systems that don't fit in single-node memory, distribute the matrix and vectors across \(P\) MPI ranks.

Data Distribution

1D row distribution: Rank \(p\) owns rows \([p \cdot (n/P), (p+1) \cdot (n/P))\)

Rank 0: rows 0 to n/P - 1
Rank 1: rows n/P to 2n/P - 1
...
Rank P-1: rows (P-1)n/P to n - 1

Each rank stores:

Local portion of \(A\): size \((n/P) \times n\) (full rows)
Local portions of \(\mathbf{x}, \mathbf{b}, \mathbf{r}, \mathbf{p}\): size \(n/P\)

Distributed Operations

Matrix-vector product ( \(\mathbf{y} = A\mathbf{x}\)):

void distributed_matvec(const Matrix& A_local, const Vector& x_local,
                        Vector& y_local, MPI_Comm comm) {
    int rank, size;
    MPI_Comm_rank(comm, &rank);
    MPI_Comm_size(comm, &size);
 
    idx local_rows = A_local.rows();
    idx n = local_rows * size;  // Total size
 
    // Gather full x vector (each rank needs all of x)
    Vector x_full(n);
    MPI_Allgather(x_local.data(), local_rows, MPI_DOUBLE,
                  x_full.data(), local_rows, MPI_DOUBLE, comm);
 
    // Local matvec (each rank computes its rows)
    for (idx i = 0; i < local_rows; ++i) {
        real sum = 0;
        for (idx j = 0; j < n; ++j)
            sum += A_local(i, j) * x_full[j];
        y_local[i] = sum;
    }
}

Distributed dot product:

real distributed_dot(const Vector& x_local, const Vector& y_local,
                     MPI_Comm comm) {
    // Local partial sum
    real local_sum = 0;
    for (idx i = 0; i < x_local.size(); ++i)
        local_sum += x_local[i] * y_local[i];
 
    // Global reduction
    real global_sum;
    MPI_Allreduce(&local_sum, &global_sum, 1, MPI_DOUBLE, MPI_SUM, comm);
 
    return global_sum;
}

Distributed AXPY (no communication needed):

void distributed_axpy(real alpha, const Vector& x_local, Vector& y_local) {
    for (idx i = 0; i < x_local.size(); ++i)
        y_local[i] += alpha * x_local[i];
}

Distributed CG Algorithm

SolverResult distributed_cg(const Matrix& A_local, const Vector& b_local,
                            Vector& x_local, real tol, idx max_iter,
                            MPI_Comm comm) {
    idx local_n = b_local.size();
    Vector r_local(local_n), p_local(local_n), Ap_local(local_n);
 
    // r = b - A*x
    distributed_matvec(A_local, x_local, r_local, comm);
    for (idx i = 0; i < local_n; ++i)
        r_local[i] = b_local[i] - r_local[i];
 
    // p = r
    p_local = r_local;
 
    real rsold = distributed_dot(r_local, r_local, comm);
 
    for (idx iter = 0; iter < max_iter; ++iter) {
        distributed_matvec(A_local, p_local, Ap_local, comm);
 
        real pAp = distributed_dot(p_local, Ap_local, comm);
        real alpha = rsold / pAp;
 
        distributed_axpy(alpha, p_local, x_local);
        distributed_axpy(-alpha, Ap_local, r_local);
 
        real rsnew = distributed_dot(r_local, r_local, comm);
 
        if (sqrt(rsnew) < tol) {
            return {iter + 1, sqrt(rsnew), true};
        }
 
        real beta = rsnew / rsold;
        for (idx i = 0; i < local_n; ++i)
            p_local[i] = r_local[i] + beta * p_local[i];
 
        rsold = rsnew;
    }
 
    return {max_iter, sqrt(rsold), false};
}

Communication Analysis

Per CG iteration:

Operation	Communication
`matvec`	`MPI_Allgather` — \(O(n)\) data, \(O(\log P)\) latency
`dot` (×2)	`MPI_Allreduce` — \(O(1)\) data, \(O(\log P)\) latency
`axpy` (×3)	None

Total per iteration:

1 Allgather: \(O(n/P \cdot P) = O(n)\) bytes
2 Allreduce: \(O(1)\) bytes each

Communication is dominated by the Allgather in matvec.

5.2 Distributed Thomas (Pipeline)

For a single tridiagonal system distributed across ranks, use pipelining:

Rank 0: rows 0 to n/P - 1
Rank 1: rows n/P to 2n/P - 1  (needs data from Rank 0)
...

Forward sweep: Each rank waits for modified values from previous rank:

void distributed_thomas_forward(Vector& b_local, Vector& d_local,
                                const Vector& a_local, const Vector& c_local,
                                MPI_Comm comm) {
    int rank, size;
    MPI_Comm_rank(comm, &rank);
    MPI_Comm_size(comm, &size);
 
    idx local_n = b_local.size();
 
    // Receive boundary data from previous rank
    if (rank > 0) {
        real recv_b, recv_d;
        MPI_Recv(&recv_b, 1, MPI_DOUBLE, rank - 1, 0, comm, MPI_STATUS_IGNORE);
        MPI_Recv(&recv_d, 1, MPI_DOUBLE, rank - 1, 1, comm, MPI_STATUS_IGNORE);
 
        // First local row depends on last row of previous rank
        real w = a_local[0] / recv_b;  // a_local[0] connects to previous rank
        b_local[0] -= w * c_local[local_n - 1];  // Approximate; actual indexing depends on layout
        d_local[0] -= w * recv_d;
    }
 
    // Local forward elimination
    for (idx i = 1; i < local_n; ++i) {
        real w = a_local[i] / b_local[i - 1];
        b_local[i] -= w * c_local[i - 1];
        d_local[i] -= w * d_local[i - 1];
    }
 
    // Send boundary data to next rank
    if (rank < size - 1) {
        MPI_Send(&b_local[local_n - 1], 1, MPI_DOUBLE, rank + 1, 0, comm);
        MPI_Send(&d_local[local_n - 1], 1, MPI_DOUBLE, rank + 1, 1, comm);
    }
}

Note: Pipeline parallelism has limited scalability. For better parallel efficiency, consider:

Cyclic Reduction: \(O(\log n)\) parallel depth
Spike Algorithm: Domain decomposition with coupling solve

6. Hybrid MPI+CUDA Implementation

For multi-GPU clusters, combine MPI (inter-node) with CUDA (intra-node):

SolverResult hybrid_cg(const Matrix& A_local, const Vector& b_local,
                       Vector& x_local, real tol, idx max_iter,
                       MPI_Comm comm) {
    // Each MPI rank has one GPU
    int rank;
    MPI_Comm_rank(comm, &rank);
    cudaSetDevice(rank % num_gpus_per_node);
 
    // Transfer local data to GPU
    A_local.to_gpu();
    b_local.to_gpu();
    x_local.to_gpu();
 
    // ... CG iterations using:
    // - GPU kernels for local operations
    // - CUDA-aware MPI for communication
 
    // Example: GPU-aware Allgather
    MPI_Allgather(x_local.gpu_data(), local_n, MPI_DOUBLE,
                  x_full_gpu, local_n, MPI_DOUBLE, comm);
}

CUDA-aware MPI allows direct GPU-to-GPU transfers without staging through CPU memory.

7. Summary

Method	Complexity	Parallelization	Best Use Case
CG (CPU)	\(O(n^2)\)/iter	OpenMP threads	Small-medium SPD
CG (GPU)	\(O(n^2)\)/iter	CUDA kernels	Large SPD ( \(n > 1000\))
CG (MPI)	\(O(n^2/P)\)/iter + comm	Distributed	Huge SPD
Thomas (CPU)	\(O(n)\)	None (sequential)	Single tridiagonal
Thomas Batched (GPU)	\(O(n \cdot k/P_{GPU})\)	Thread per system	Many tridiagonals
Thomas (MPI)	\(O(n/P)\) + pipeline	Limited	Large single tridiag

Key insights:

CG parallelizes naturally; matvec dominates cost
Thomas is inherently sequential; parallelize across systems, not within
GPU advantage requires sufficient problem size to amortize launch overhead
MPI CG needs careful attention to communication (Allgather is expensive)