|
numerics
|
This page covers the three parallelism strategies used in this library: MPI for distributed-memory clusters, CUDA for GPU acceleration, and banded matrix solvers that exploit sparsity structure for \(O(nk^2)\) factorization cost. Each section describes the programming model, the key performance tradeoffs, and the library API.
MPI (Message Passing Interface) follows the SPMD (Single Program, Multiple Data) model: every process runs the same executable but works on a different portion of the data. Each process has completely private memory; no shared address space exists. All data exchange is explicit — the programmer controls every transfer.
This approach scales to thousands of nodes across high-speed interconnects because it imposes no hardware constraint for cache coherence or memory sharing.
MPI_Send and MPI_Recv are the building blocks. Their main hazard is deadlock: if every process in a communicating pair calls MPI_Recv before MPI_Send, all processes wait forever. MPI_Sendrecv resolves this by combining both operations atomically:
Non-blocking variants (MPI_Isend, MPI_Irecv) return immediately and allow computation to proceed while the message is in flight. MPI_Wait blocks until completion. This is the foundation for overlapping communication with computation, discussed below.
Collectives involve all processes in a communicator simultaneously. They are implemented with tree-based algorithms optimised by the MPI library:
| Collective | Description | Latency complexity |
|---|---|---|
MPI_Bcast | Root sends data to all | \(O(\log p)\) |
MPI_Reduce | Combine all data to root | \(O(\log p)\) |
MPI_Allreduce | Reduce, result to all | \(O(\log p)\) |
MPI_Gather | Collect to root | \(O(p)\) |
MPI_Scatter | Distribute from root | \(O(p)\) |
MPI_Allgather | Gather, result to all | \(O(p)\) |
Prefer MPI_Allreduce over MPI_Reduce + MPI_Bcast — the combined collective is typically faster due to fewer synchronisation barriers.
Each rank holds a contiguous block of \(\mathbf{x}\) and \(\mathbf{y}\). The global dot product is assembled via a local partial sum followed by an Allreduce:
\[ \mathbf{x}^T\mathbf{y} = \sum_r \sum_{j \in \text{local}_r} x_j y_j \]
The communication cost is one Allreduce — \(O(\log p)\) latency.
For \(\mathbf{y} = A\mathbf{x}\) with A stored in 1-D row distribution:
The communication cost is \(O(n)\) per process per matvec.
CG for \(A\mathbf{x} = \mathbf{b}\) requires only matvec, dot products, and AXPY operations. In distributed form, each of \(k\) iterations costs:
Allreduce calls (for \(\rho_{\text{new}} = \mathbf{r}^T\mathbf{r}\) and \(\alpha = \rho / \mathbf{p}^T A\mathbf{p}\))Allgather (to gather \(\mathbf{p}\) for the distributed matvec)Total communication: \(O(k \log p)\) latency across all iterations.
MPI-3 introduced non-blocking collectives. A typical pattern in parallel CG issues the Allreduce for \(\rho\) immediately after computing the local partial sum, then performs independent local AXPY updates while the reduction proceeds:
For strong-scaling regimes where communication latency dominates, this overlap can recover 20–40 % of iteration time.
API: num::mpi::dot, num::mpi::norm, num::mpi::allreduce_sum, num::mpi::broadcast
NVIDIA GPUs expose a three-level hierarchy:
A kernel launch specifies a grid of thread blocks; each block is assigned to one SM and its threads share that SM's shared memory.
| Memory type | Scope | Capacity | Approximate bandwidth | Lifetime |
|---|---|---|---|---|
| Registers | Thread | ~255 per thread | — (zero latency) | Thread |
| Shared / L1 | Block | 48–164 KB per SM | ~100 TB/s | Block |
| Global | Grid | GBs | ~500 GB/s | Application |
| Constant | Grid | 64 KB | cached | Application |
Shared memory bandwidth (~100 TB/s) is roughly 200× higher than global memory bandwidth (~500 GB/s). The central optimization strategy for GPU kernels is to load data from global memory into shared memory once and reuse it many times within a block.
When threads in a warp read consecutive addresses, the hardware merges the 32 individual requests into a single 128-byte transaction. When threads read strided addresses, each request becomes a separate transaction — up to 32× more bandwidth consumed.
Row-major matrix layouts naturally produce coalesced row reads; column reads are strided and should be avoided in the innermost loop or loaded via shared memory.
The naive GPU matmul reads every element of A and B \(N\) times from global memory. With \(T \times T\) tiles loaded into shared memory:
For \(T = 16\), a 1024×1024 multiplication reduces global memory traffic from \(\sim 16\) GB to \(\sim 1\) GB. The __syncthreads() barrier after loading each tile ensures all threads see the complete tile before the compute phase begins.
Launching a CUDA kernel incurs roughly 5–20 µs of overhead (PCIe transfer setup, driver dispatch). For small operands the launch cost dominates and the GPU never wins. Crossover points measured on this library's kernels:
| Operation | GPU crossover point |
|---|---|
| Dot product | GPU never wins (memory-bound + launch overhead) |
| AXPY | ~64 K elements |
| Matvec | ~256×256 matrix |
| Matmul | GPU always wins (compute-bound) |
For production use, keep data on the GPU across multiple kernel calls to amortise transfer costs.
The Thomas algorithm solves a tridiagonal system of size \(n\) in \(O(n)\) operations — it is inherently sequential. The GPU makes it profitable by running one thread per independent system:
For \(k \geq 1024\) independent systems, warp occupancy is high and the kernel achieves near-peak utilisation. This is exactly the pattern arising in radiative transfer codes (one tridiagonal per atmospheric column per spectral band) and in banded matrix applications with large batch counts.
API: num::cuda::matvec, num::cuda::matmul, num::cuda::thomas_batched
A matrix \(A \in \mathbb{R}^{n \times n}\) is banded with lower bandwidth \(k_l\) and upper bandwidth \(k_u\) if
\[ A_{ij} = 0 \quad \text{for } |i - j| > k_l \text{ (lower) or } |i - j| > k_u \text{ (upper)} \]
Common special cases:
Exploiting band structure reduces storage from \(O(n^2)\) to \(O(n(k_l + k_u + 1))\) and factorization cost from \(O(n^3)\) to \(O(n k_l (k_l + k_u))\).
The standard compact format stores diagonals in the rows of a 2-D array with leading dimension \(\text{ldab} = 2k_l + k_u + 1\). Element \(A_{ij}\) within the band maps to
\[ \text{band}[k_l + k_u + i - j][j] = A_{ij} \]
using 0-based indexing. The array has \(\text{ldab}\) rows and \(n\) columns.
The upper \(k_l\) rows (indices \(0 \ldots k_l - 1\)) are reserved for fill-in during LU factorisation. With partial pivoting, the \(U\) factor can develop up to \(k_l\) additional super-diagonals beyond the original \(k_u\), so the storage must accommodate upper bandwidth \(k_l + k_u\).
Column-major storage ensures that all elements in a column are contiguous in memory, giving sequential access patterns during forward and back substitution.
The factorisation \(PA = LU\) proceeds column by column. For column \(j\):
Fill-in reaches at most \(k_l\) positions above the original upper diagonal, so \(U\) has bandwidth \(k_l + k_u\) — exactly the extra rows reserved in the band array.
Complexity comparison:
| Method | Factorization | Solve | Storage |
|---|---|---|---|
| Dense LU | \(\tfrac{2}{3}n^3\) | \(2n^2\) | \(n^2\) |
| Banded LU | \(O\!\left(n k_l (k_l + k_u)\right)\) | \(O\!\left(n(k_l+k_u)\right)\) | \((2k_l+k_u+1)n\) |
| Thomas (tridiagonal) | \(8n\) | \(5n\) | \(4n\) |
For \(n = 10\,000\) with a pentadiagonal system ( \(k_l = k_u = 2\)):
Partial pivoting bounds the growth factor for dense LU at \(2^{n-1}\) in the worst case. For banded LU, the bound is much tighter:
\[ \rho \leq 2^{k_l} \]
because only \(k_l\) rows participate in each elimination step. For typical values \(k_l = 1\) or \(2\) the growth factor is bounded by 2 or 4, making banded LU extremely stable in practice.
For diagonally dominant banded matrices ( \(|a_{ii}| \geq \sum_{j \neq i} |a_{ij}|\) for all \(i\)), pivoting is unnecessary: the diagonal is always the largest element in its column, there is no fill-in beyond the original bandwidth, and the factorisation is unconditionally stable. Laplacian discretisations and many radiative transfer systems fall into this category.
Given the factored \(PA = LU\), solving \(A\mathbf{x} = \mathbf{b}\) requires three sequential sweeps:
Total solve cost: \(O(n(k_l + k_u))\). For a pentadiagonal system with \(n = 10\,000\) this is roughly 40 000 FLOPs — negligible compared to any meaningful problem setup cost.
API: num::BandedMatrix, num::banded_lu, num::banded_solve, num::banded_matvec