This page covers the three parallelism strategies used in this library: MPI for distributed-memory clusters, CUDA for GPU acceleration, and banded matrix solvers that exploit sparsity structure for \(O(nk^2)\) factorization cost. Each section describes the programming model, the key performance tradeoffs, and the library API.

MPI: Distributed Memory

The SPMD Model

MPI (Message Passing Interface) follows the SPMD (Single Program, Multiple Data) model: every process runs the same executable but works on a different portion of the data. Each process has completely private memory; no shared address space exists. All data exchange is explicit – the programmer controls every transfer.

This approach scales to thousands of nodes across high-speed interconnects because it imposes no hardware constraint for cache coherence or memory sharing.

Point-to-Point Communication

MPI_Send and MPI_Recv are the building blocks. Their main hazard is deadlock: if every process in a communicating pair calls MPI_Recv before MPI_Send, all processes wait forever. MPI_Sendrecv resolves this by combining both operations atomically:

MPI_Sendrecv(&send_data, 1, MPI_DOUBLE, partner, 0,
             &recv_data, 1, MPI_DOUBLE, partner, 0,
             MPI_COMM_WORLD, MPI_STATUS_IGNORE);

Non-blocking variants (MPI_Isend, MPI_Irecv) return immediately and allow computation to proceed while the message is in flight. MPI_Wait blocks until completion. This is the foundation for overlapping communication with computation, discussed below.

Collective Operations

Collectives involve all processes in a communicator simultaneously. They are implemented with tree-based algorithms optimised by the MPI library:

Collective	Description	Latency complexity
`MPI_Bcast`	Root sends data to all	\(O(\log p)\)
`MPI_Reduce`	Combine all data to root	\(O(\log p)\)
`MPI_Allreduce`	Reduce, result to all	\(O(\log p)\)
`MPI_Gather`	Collect to root	\(O(p)\)
`MPI_Scatter`	Distribute from root	\(O(p)\)
`MPI_Allgather`	Gather, result to all	\(O(p)\)

Prefer MPI_Allreduce over MPI_Reduce + MPI_Bcast – the combined collective is typically faster due to fewer synchronisation barriers.

Distributed Dot Product

Each rank holds a contiguous block of \(\mathbf{x}\) and \(\mathbf{y}\). The global dot product is assembled via a local partial sum followed by an Allreduce:

\[ \mathbf{x}^T\mathbf{y} = \sum_r \sum_{j \in \text{local}_r} x_j y_j \]

The communication cost is one Allreduce – \(O(\log p)\) latency.

Distributed Matrix-Vector Product

For \(\mathbf{y} = A\mathbf{x}\) with A stored in 1-D row distribution:

Allgather the distributed vector \(\mathbf{x}\) so every process holds the full vector ( \(O(n)\) per process).
Each process computes its local rows of \(\mathbf{y}\) independently.

The communication cost is \(O(n)\) per process per matvec.

Parallel Conjugate Gradient

CG for \(A\mathbf{x} = \mathbf{b}\) requires only matvec, dot products, and AXPY operations. In distributed form, each of \(k\) iterations costs:

2 Allreduce calls (for \(\rho_{\text{new}} = \mathbf{r}^T\mathbf{r}\) and \(\alpha = \rho / \mathbf{p}^T A\mathbf{p}\))
1 Allgather (to gather \(\mathbf{p}\) for the distributed matvec)

Total communication: \(O(k \log p)\) latency across all iterations.

Hiding Latency with Non-Blocking Collectives

MPI-3 introduced non-blocking collectives. A typical pattern in parallel CG issues the Allreduce for \(\rho\) immediately after computing the local partial sum, then performs independent local AXPY updates while the reduction proceeds:

MPI_Request req;
MPI_Iallreduce(&local_dot, &global_dot, 1, MPI_DOUBLE,
               MPI_SUM, comm, &req);
 
// Independent local work (AXPY, etc.) runs here in parallel
do_local_axpy_updates();
 
MPI_Wait(&req, MPI_STATUS_IGNORE);  // Synchronise before using global_dot

For strong-scaling regimes where communication latency dominates, this overlap can recover 20-40 % of iteration time.

API: num::mpi::dot, num::mpi::norm, num::mpi::allreduce_sum, num::mpi::broadcast

CUDA: GPU Computing

GPU Hierarchy

NVIDIA GPUs expose a three-level hierarchy:

SMs (Streaming Multiprocessors): Independent processing units, each with its own registers, shared memory, and warp schedulers.
Warps: Groups of 32 threads executing in SIMT (Single Instruction, Multiple Threads) lockstep. All threads in a warp issue the same instruction each cycle; divergent branches serialise execution.
Threads: Individual scalar lanes within a warp.

A kernel launch specifies a grid of thread blocks; each block is assigned to one SM and its threads share that SM's shared memory.

GPU Memory Hierarchy

Memory type	Scope	Capacity	Approximate bandwidth	Lifetime
Registers	Thread	~255 per thread	– (zero latency)	Thread
Shared / L1	Block	48-164 KB per SM	~100 TB/s	Block
Global	Grid	GBs	~500 GB/s	Application
Constant	Grid	64 KB	cached	Application

Shared memory bandwidth (~100 TB/s) is roughly 200x higher than global memory bandwidth (~500 GB/s). The central optimization strategy for GPU kernels is to load data from global memory into shared memory once and reuse it many times within a block.

Coalesced Global Memory Access

When threads in a warp read consecutive addresses, the hardware merges the 32 individual requests into a single 128-byte transaction. When threads read strided addresses, each request becomes a separate transaction – up to 32x more bandwidth consumed.

Coalesced (thread i reads element i): 1 transaction

Strided (thread i reads element i*stride): up to 32 transactions

Row-major matrix layouts naturally produce coalesced row reads; column reads are strided and should be avoided in the innermost loop or loaded via shared memory.

Tiled Matrix Multiplication with Shared Memory

The naive GPU matmul reads every element of A and B \(N\) times from global memory. With \(T \times T\) tiles loaded into shared memory:

Each \(T \times T\) tile is loaded once into shared memory per tile iteration.
It is reused \(T\) times within the block for the partial dot products.
**Global memory reads are reduced by a factor of \(T\).**

For \(T = 16\), a 1024x1024 multiplication reduces global memory traffic from \(\sim 16\) GB to \(\sim 1\) GB. The __syncthreads() barrier after loading each tile ensures all threads see the complete tile before the compute phase begins.

Kernel Launch Overhead and Crossover Points

Launching a CUDA kernel incurs roughly 5-20 mus of overhead (PCIe transfer setup, driver dispatch). For small operands the launch cost dominates and the GPU never wins. Crossover points measured on this library's kernels:

Operation	GPU crossover point
Dot product	GPU never wins (memory-bound + launch overhead)
AXPY	~64 K elements
Matvec	~256x256 matrix
Matmul	GPU always wins (compute-bound)

For production use, keep data on the GPU across multiple kernel calls to amortise transfer costs.

Batched Thomas Algorithm

The Thomas algorithm solves a tridiagonal system of size \(n\) in \(O(n)\) operations – it is inherently sequential. The GPU makes it profitable by running one thread per independent system:

Thread 0: system 0 (one atmospheric column, one wavelength band)
Thread 1: system 1
...
Thread k-1: system k-1

For \(k \geq 1024\) independent systems, warp occupancy is high and the kernel achieves near-peak utilisation. This is exactly the pattern arising in radiative transfer codes (one tridiagonal per atmospheric column per spectral band) and in banded matrix applications with large batch counts.

API: num::cuda::matvec, num::cuda::matmul, num::cuda::thomas_batched

Banded Matrix Systems

Band Structure

A matrix \(A \in \mathbb{R}^{n \times n}\) is banded with lower bandwidth \(k_l\) and upper bandwidth \(k_u\) if

\[ A_{ij} = 0 \quad \text{for } |i - j| > k_l \text{ (lower) or } |i - j| > k_u \text{ (upper)} \]

Common special cases:

Tridiagonal: \(k_l = k_u = 1\) – arises from second-order finite difference discretisations of 1-D boundary value problems.
Pentadiagonal: \(k_l = k_u = 2\) – arises from fourth-order stencils and two-stream radiative transfer discretisations.

Exploiting band structure reduces storage from \(O(n^2)\) to \(O(n(k_l + k_u + 1))\) and factorization cost from \(O(n^3)\) to \(O(n k_l (k_l + k_u))\).

LAPACK-Style Band Storage

The standard compact format stores diagonals in the rows of a 2-D array with leading dimension \(\text{ldab} = 2k_l + k_u + 1\). Element \(A_{ij}\) within the band maps to

\[ \text{band}[k_l + k_u + i - j][j] = A_{ij} \]

using 0-based indexing. The array has \(\text{ldab}\) rows and \(n\) columns.

The upper \(k_l\) rows (indices \(0 \ldots k_l - 1\)) are reserved for fill-in during LU factorisation. With partial pivoting, the \(U\) factor can develop up to \(k_l\) additional super-diagonals beyond the original \(k_u\), so the storage must accommodate upper bandwidth \(k_l + k_u\).

Column-major storage ensures that all elements in a column are contiguous in memory, giving sequential access patterns during forward and back substitution.

Banded LU with Partial Pivoting

The factorisation \(PA = LU\) proceeds column by column. For column \(j\):

Search rows \(j\) to \(\min(j + k_l,\, n-1)\) for the largest-magnitude pivot.
Swap the pivot row with row \(j\).
Compute multipliers \(l_{ij} = a_{ij}/a_{jj}\) for rows \(j+1\) to \(j+k_l\).
Update the sub-matrix for columns \(j+1\) to \(j+k_u\) (original upper bandwidth).

Fill-in reaches at most \(k_l\) positions above the original upper diagonal, so \(U\) has bandwidth \(k_l + k_u\) – exactly the extra rows reserved in the band array.

Complexity comparison:

Method	Factorization	Solve	Storage
Dense LU	\(\tfrac{2}{3}n^3\)	\(2n^2\)	\(n^2\)
Banded LU	\(O\!\left(n k_l (k_l + k_u)\right)\)	\(O\!\left(n(k_l+k_u)\right)\)	\((2k_l+k_u+1)n\)
Thomas (tridiagonal)	\(8n\)	\(5n\)	\(4n\)

For \(n = 10\,000\) with a pentadiagonal system ( \(k_l = k_u = 2\)):

Dense factorization: \(\approx 6.7 \times 10^{11}\) FLOPs
Banded factorization: \(10\,000 \times 2 \times 4 = 80\,000\) FLOPs
**Speedup: \(\approx 8 \times 10^6\times\)**

Numerical Stability: Growth Factor

Partial pivoting bounds the growth factor for dense LU at \(2^{n-1}\) in the worst case. For banded LU, the bound is much tighter:

\[ \rho \leq 2^{k_l} \]

because only \(k_l\) rows participate in each elimination step. For typical values \(k_l = 1\) or \(2\) the growth factor is bounded by 2 or 4, making banded LU extremely stable in practice.

For diagonally dominant banded matrices ( \(|a_{ii}| \geq \sum_{j \neq i} |a_{ij}|\) for all \(i\)), pivoting is unnecessary: the diagonal is always the largest element in its column, there is no fill-in beyond the original bandwidth, and the factorisation is unconditionally stable. Laplacian discretisations and many radiative transfer systems fall into this category.

Solve Phase

Given the factored \(PA = LU\), solving \(A\mathbf{x} = \mathbf{b}\) requires three sequential sweeps:

Apply permutation: \(\mathbf{b} \leftarrow P\mathbf{b}\) using the stored pivot indices.
Forward substitution: solve \(L\mathbf{y} = P\mathbf{b}\), touching at most \(k_l\) off-diagonal entries per row.
Back substitution: solve \(U\mathbf{x} = \mathbf{y}\), touching at most \(k_l + k_u\) super-diagonal entries per row.

Total solve cost: \(O(n(k_l + k_u))\). For a pentadiagonal system with \(n = 10\,000\) this is roughly 40 000 FLOPs – negligible compared to any meaningful problem setup cost.

API: num::BandedMatrix, num::banded_lu, num::banded_solve, num::banded_matvec