|
numerics 0.1.0
|
This page covers the three parallelism strategies used in this library: MPI for distributed-memory clusters, CUDA for GPU acceleration, and banded matrix solvers that exploit sparsity structure for \(O(nk^2)\) factorization cost. Each section describes the programming model, the key performance tradeoffs, and the library API.
MPI (Message Passing Interface) follows the SPMD (Single Program, Multiple Data) model: every process runs the same executable but works on a different portion of the data. Each process has completely private memory; no shared address space exists. All data exchange is explicit – the programmer controls every transfer.
This approach scales to thousands of nodes across high-speed interconnects because it imposes no hardware constraint for cache coherence or memory sharing.
MPI_Send and MPI_Recv are the building blocks. Their main hazard is deadlock: if every process in a communicating pair calls MPI_Recv before MPI_Send, all processes wait forever. MPI_Sendrecv resolves this by combining both operations atomically:
Non-blocking variants (MPI_Isend, MPI_Irecv) return immediately and allow computation to proceed while the message is in flight. MPI_Wait blocks until completion. This is the foundation for overlapping communication with computation, discussed below.
Collectives involve all processes in a communicator simultaneously. They are implemented with tree-based algorithms optimised by the MPI library:
| Collective | Description | Latency complexity |
|---|---|---|
MPI_Bcast | Root sends data to all | \(O(\log p)\) |
MPI_Reduce | Combine all data to root | \(O(\log p)\) |
MPI_Allreduce | Reduce, result to all | \(O(\log p)\) |
MPI_Gather | Collect to root | \(O(p)\) |
MPI_Scatter | Distribute from root | \(O(p)\) |
MPI_Allgather | Gather, result to all | \(O(p)\) |
Prefer MPI_Allreduce over MPI_Reduce + MPI_Bcast – the combined collective is typically faster due to fewer synchronisation barriers.
Each rank holds a contiguous block of \(\mathbf{x}\) and \(\mathbf{y}\). The global dot product is assembled via a local partial sum followed by an Allreduce:
\[ \mathbf{x}^T\mathbf{y} = \sum_r \sum_{j \in \text{local}_r} x_j y_j \]
The communication cost is one Allreduce – \(O(\log p)\) latency.
For \(\mathbf{y} = A\mathbf{x}\) with A stored in 1-D row distribution:
The communication cost is \(O(n)\) per process per matvec.
CG for \(A\mathbf{x} = \mathbf{b}\) requires only matvec, dot products, and AXPY operations. In distributed form, each of \(k\) iterations costs:
Allreduce calls (for \(\rho_{\text{new}} = \mathbf{r}^T\mathbf{r}\) and \(\alpha = \rho / \mathbf{p}^T A\mathbf{p}\))Allgather (to gather \(\mathbf{p}\) for the distributed matvec)Total communication: \(O(k \log p)\) latency across all iterations.
MPI-3 introduced non-blocking collectives. A typical pattern in parallel CG issues the Allreduce for \(\rho\) immediately after computing the local partial sum, then performs independent local AXPY updates while the reduction proceeds:
For strong-scaling regimes where communication latency dominates, this overlap can recover 20-40 % of iteration time.
API: num::mpi::dot, num::mpi::norm, num::mpi::allreduce_sum, num::mpi::broadcast
NVIDIA GPUs expose a three-level hierarchy:
A kernel launch specifies a grid of thread blocks; each block is assigned to one SM and its threads share that SM's shared memory.
| Memory type | Scope | Capacity | Approximate bandwidth | Lifetime |
|---|---|---|---|---|
| Registers | Thread | ~255 per thread | – (zero latency) | Thread |
| Shared / L1 | Block | 48-164 KB per SM | ~100 TB/s | Block |
| Global | Grid | GBs | ~500 GB/s | Application |
| Constant | Grid | 64 KB | cached | Application |
Shared memory bandwidth (~100 TB/s) is roughly 200x higher than global memory bandwidth (~500 GB/s). The central optimization strategy for GPU kernels is to load data from global memory into shared memory once and reuse it many times within a block.
When threads in a warp read consecutive addresses, the hardware merges the 32 individual requests into a single 128-byte transaction. When threads read strided addresses, each request becomes a separate transaction – up to 32x more bandwidth consumed.
Row-major matrix layouts naturally produce coalesced row reads; column reads are strided and should be avoided in the innermost loop or loaded via shared memory.
The naive GPU matmul reads every element of A and B \(N\) times from global memory. With \(T \times T\) tiles loaded into shared memory:
For \(T = 16\), a 1024x1024 multiplication reduces global memory traffic from \(\sim 16\) GB to \(\sim 1\) GB. The __syncthreads() barrier after loading each tile ensures all threads see the complete tile before the compute phase begins.
Launching a CUDA kernel incurs roughly 5-20 mus of overhead (PCIe transfer setup, driver dispatch). For small operands the launch cost dominates and the GPU never wins. Crossover points measured on this library's kernels:
| Operation | GPU crossover point |
|---|---|
| Dot product | GPU never wins (memory-bound + launch overhead) |
| AXPY | ~64 K elements |
| Matvec | ~256x256 matrix |
| Matmul | GPU always wins (compute-bound) |
For production use, keep data on the GPU across multiple kernel calls to amortise transfer costs.
The Thomas algorithm solves a tridiagonal system of size \(n\) in \(O(n)\) operations – it is inherently sequential. The GPU makes it profitable by running one thread per independent system:
For \(k \geq 1024\) independent systems, warp occupancy is high and the kernel achieves near-peak utilisation. This is exactly the pattern arising in radiative transfer codes (one tridiagonal per atmospheric column per spectral band) and in banded matrix applications with large batch counts.
API: num::cuda::matvec, num::cuda::matmul, num::cuda::thomas_batched
A matrix \(A \in \mathbb{R}^{n \times n}\) is banded with lower bandwidth \(k_l\) and upper bandwidth \(k_u\) if
\[ A_{ij} = 0 \quad \text{for } |i - j| > k_l \text{ (lower) or } |i - j| > k_u \text{ (upper)} \]
Common special cases:
Exploiting band structure reduces storage from \(O(n^2)\) to \(O(n(k_l + k_u + 1))\) and factorization cost from \(O(n^3)\) to \(O(n k_l (k_l + k_u))\).
The standard compact format stores diagonals in the rows of a 2-D array with leading dimension \(\text{ldab} = 2k_l + k_u + 1\). Element \(A_{ij}\) within the band maps to
\[ \text{band}[k_l + k_u + i - j][j] = A_{ij} \]
using 0-based indexing. The array has \(\text{ldab}\) rows and \(n\) columns.
The upper \(k_l\) rows (indices \(0 \ldots k_l - 1\)) are reserved for fill-in during LU factorisation. With partial pivoting, the \(U\) factor can develop up to \(k_l\) additional super-diagonals beyond the original \(k_u\), so the storage must accommodate upper bandwidth \(k_l + k_u\).
Column-major storage ensures that all elements in a column are contiguous in memory, giving sequential access patterns during forward and back substitution.
The factorisation \(PA = LU\) proceeds column by column. For column \(j\):
Fill-in reaches at most \(k_l\) positions above the original upper diagonal, so \(U\) has bandwidth \(k_l + k_u\) – exactly the extra rows reserved in the band array.
Complexity comparison:
| Method | Factorization | Solve | Storage |
|---|---|---|---|
| Dense LU | \(\tfrac{2}{3}n^3\) | \(2n^2\) | \(n^2\) |
| Banded LU | \(O\!\left(n k_l (k_l + k_u)\right)\) | \(O\!\left(n(k_l+k_u)\right)\) | \((2k_l+k_u+1)n\) |
| Thomas (tridiagonal) | \(8n\) | \(5n\) | \(4n\) |
For \(n = 10\,000\) with a pentadiagonal system ( \(k_l = k_u = 2\)):
Partial pivoting bounds the growth factor for dense LU at \(2^{n-1}\) in the worst case. For banded LU, the bound is much tighter:
\[ \rho \leq 2^{k_l} \]
because only \(k_l\) rows participate in each elimination step. For typical values \(k_l = 1\) or \(2\) the growth factor is bounded by 2 or 4, making banded LU extremely stable in practice.
For diagonally dominant banded matrices ( \(|a_{ii}| \geq \sum_{j \neq i} |a_{ij}|\) for all \(i\)), pivoting is unnecessary: the diagonal is always the largest element in its column, there is no fill-in beyond the original bandwidth, and the factorisation is unconditionally stable. Laplacian discretisations and many radiative transfer systems fall into this category.
Given the factored \(PA = LU\), solving \(A\mathbf{x} = \mathbf{b}\) requires three sequential sweeps:
Total solve cost: \(O(n(k_l + k_u))\). For a pentadiagonal system with \(n = 10\,000\) this is roughly 40 000 FLOPs – negligible compared to any meaningful problem setup cost.
API: num::BandedMatrix, num::banded_lu, num::banded_solve, num::banded_matvec