|
numerics
|
This page covers the theoretical limits that govern single-node performance, then traces the successive optimizations applied to dense matrix multiplication in this library: cache blocking, register blocking, and explicit SIMD vectorization. Each section connects the underlying hardware model to the implementation.
CPU clock speeds reached approximately 3–4 GHz around 2005 and have not risen meaningfully since. The root cause is power dissipation. The dynamic power consumed by a CMOS circuit scales as
\[ P \propto C V^2 f \]
where \(C\) is the switching capacitance of the circuit, \(V\) is the supply voltage, and \(f\) is the clock frequency. Higher frequency requires higher voltage to maintain signal integrity, so power scales roughly as \(f^3\). Doubling frequency would require roughly eight times the cooling capacity — a physical wall.
The industry response was to add more cores rather than faster cores. Exploiting those cores requires parallelism, which brings its own limits.
If a fraction \(s\) of a program is inherently serial — cannot be parallelized regardless of hardware — then with \(p\) processors the achievable speedup is
\[ S(p) = \frac{1}{s + \dfrac{1-s}{p}} \]
As \(p \to \infty\) the parallel fraction vanishes and the speedup saturates at
\[ S_{\max} = \frac{1}{s} \]
A program that is 5 % serial ( \(s = 0.05\)) can never exceed 20× speedup, regardless of how many processors are used. Identifying and eliminating serial bottlenecks is therefore as important as adding parallelism.
A kernel's attainable performance is bounded by whichever limit it hits first: peak arithmetic throughput or memory bandwidth.
\[ \text{Perf} \leq \min\!\bigl(\text{Peak FLOP/s},\; I \times \text{BW}\bigr) \]
\(I\) is the arithmetic intensity — the ratio of floating-point operations performed to bytes transferred from main memory:
\[ I = \frac{\text{FLOPs}}{\text{Bytes transferred}} \]
Kernels below the "ridge point" (where the two bounds intersect) are memory-bandwidth bound; those above it are compute bound.
| Kernel | FLOPs | Bytes read | \(I\) (FLOP/byte) | Bound |
|---|---|---|---|---|
| AXPY \(\mathbf{y} \leftarrow \alpha\mathbf{x} + \mathbf{y}\) | \(2n\) | \(3n \times 8\) | \(\tfrac{1}{12}\) | Memory |
| Dot product \(\mathbf{x}^T\mathbf{y}\) | \(2n\) | \(2n \times 8\) | \(\tfrac{1}{8}\) | Memory |
| Matrix-vector \(A\mathbf{x}\) | \(2n^2\) | \((n^2+n) \times 8\) | \(\approx \tfrac{1}{4}\) | Memory |
| Matrix-matrix \(AB\) | \(2n^3\) | \(3n^2 \times 8\) | \(\tfrac{n}{12}\) | Compute (large \(n\)) |
Matrix-matrix multiplication is the only common dense linear algebra operation that becomes compute-bound as \(n\) grows. All vector operations and matvec are permanently memory-bound at practical sizes.
Data flows from registers through several cache levels to DRAM. Each level trades capacity for speed:
| Level | Typical size | Latency | Bandwidth |
|---|---|---|---|
| Registers | ~256 B | 0 cycles | — |
| L1 cache | 32–64 KB | 4 cycles | very high |
| L2 cache | 256 KB–4 MB | 12 cycles | high |
| L3 cache | 8–32 MB | 40 cycles | moderate |
| DRAM | 8–64 GB | 200+ cycles | ~50 GB/s |
A single L3 cache miss costs ~200 cycles — enough time to execute roughly 200 floating-point multiply-adds. Reducing cache misses is therefore frequently more impactful than reducing the FLOP count.
Data moves between levels in 64-byte cache lines (8 doubles). Loading one double from DRAM pays for a full 64-byte transfer; sequential access amortises that cost across 8 elements, while random access wastes 7 of them.
The naive triple loop over \(i \to j \to k\) has the inner access pattern:
| Access | Stride as \(k\) increases | Cache behaviour |
|---|---|---|
| \(A(i,k)\) | +8 bytes (sequential) | stays in L1 |
| \(B(k,j)\) | \(+N \times 8\) bytes (column jump) | one new cache line per step |
| \(C(i,j)\) | 0 (scalar accumulator) | register |
For \(N = 512\), each step of the inner \(k\)-loop jumps 4 KB in B — far beyond the L1 line size. The entire column of B is re-fetched from DRAM for every output element \(C(i,j)\). **B is read from DRAM \(O(N)\) times instead of once.**
For each of the \(N^2\) output elements the inner loop touches \(N\) distinct cache lines of B (one per row, all on different lines due to the column stride). Total B cache-line loads in the naive loop:
\[ \frac{N^2 \times N}{8} = \frac{N^3}{8} \approx 16.8 \times 10^6 \text{ lines} \quad (N = 512) \]
That is roughly 1 GB transferred from DRAM for a matrix that fits in 2 MB.
Cache blocking reduces this to
\[ \frac{N^3}{8 \, B_s} \]
where \(B_s\) is the tile side length. For \(B_s = 64\) and \(N = 512\) the count drops to approximately 262 000 lines — a 64× reduction.
Partition A, B, and C into \(B_s \times B_s\) sub-matrices and compute one tile of C at a time. The working set for three tiles is
\[ 3 \, B_s^2 \times 8 \text{ bytes} \]
Select \(B_s\) so this fits in the target cache level:
| Cache level | Typical size | Working set formula | Recommended \(B_s\) |
|---|---|---|---|
| L1 (32 KB) | 32 768 B | \(3 B_s^2 \times 8 \leq 32768\) | 32 |
| L1 (64 KB) | 65 536 B | — | 48 or 64 |
| L2 (256 KB) | 262 144 B | — | 64 or 96 |
| L2 (512 KB) | 524 288 B | — | 128 |
Round \(B_s\) to a multiple of the SIMD vector width (4 doubles for AVX-256, 8 for AVX-512) to assist the auto-vectoriser. ** \(B_s = 64\) is the safe default** for server CPUs with at least 256 KB L2.
The outer three loops iterate over tile indices. The ordering \(ii \to jj \to kk\) keeps the C tile \(C[ii{:}ii{+}B_s,\, jj{:}jj{+}B_s]\) resident in L2 across all \(kk\) iterations for a fixed \((ii, jj)\) pair. The tile is loaded once and written back once. Swapping to \(ii \to kk \to jj\) would evict the C tile between \(jj\) iterations, sacrificing half the benefit.
Inside each tile, the loop order matters too. The \(i \to k \to j\) order hoists the scalar \(A(i,k)\) into a register before the innermost \(j\)-loop:
\[ \mathbf{c}_{i,\,jj:jj+B_s} \mathrel{+}= A(i,k) \cdot \mathbf{b}_{k,\,jj:jj+B_s} \]
The innermost loop is a scaled vector addition (AXPY) over sequential memory in both B and C — the shape the compiler needs to auto-vectorise with SIMD. The alternative \(i \to j \to k\) ordering reintroduces column-stride access to B within the tile, reproducing the original problem at smaller scale.
Benchmarks compiled with -O2, single-threaded, on an 8-core machine with L2 4 MB per core:
| N | Naive | Blocked ( \(B_s = 64\)) | Speedup |
|---|---|---|---|
| 64 | 160 µs | 32 µs | 5.0× |
| 128 | 1 842 µs | 263 µs | 7.0× |
| 256 | 18 135 µs | 2 761 µs | 6.6× |
| 512 | 158 131 µs | 25 036 µs | 6.3× |
The BigO constant drops from 1.18 → 0.19 — a consistent 6× improvement with no parallelism and no explicit SIMD. Both variants remain \(O(N^3)\); only the constant changes because cache-miss count per FLOP falls dramatically.
API: num::matmul_blocked
Even with the B tile resident in L2, the cache-blocked micro-kernel still performs a load-modify-store of \(C(i,j)\) on every \(k\)-step:
For a 64×64 cache tile with \(K = 512\) that is \(4096 \times 512 = 2\text{M}\) L2 load/store operations. The solution is to keep a small sub-tile of C entirely in registers (local variables), accumulate across the full \(kk\) block, then write C back to L2 only once per tile.
Declare a REG×REG array of double locals — for REG = 4 this fits in 16 scalar variables, mapped to 8 YMM registers under AVX-256 (two doubles per register when scalars are paired by the compiler). The structure is:
The \(kk\) loop remains outside the register tile, preserving B-tile L2 residency exactly as in cache blocking.
In isolation, scalar register blocking is 3× slower than plain cache blocking (measured BigO constant 0.57 vs 0.18). The explanation is vectorisation:
The cache-blocked micro-kernel's innermost \(j\)-loop runs for 64 iterations over sequential memory. The compiler auto-vectorises this as 16 AVX-256 FMA instructions. The register-blocked \(j\)-loop runs for only 4 iterations — too short to amortise loop overhead, and too small for the compiler to reliably vectorise. The savings on C memory traffic are real but smaller than the added overhead of the extra \(ir/jr\) loop nesting.
Register blocking only pays off when the tile width equals the SIMD vector width, so the "inner loop" collapses into a single vector instruction:
| Tile width | Scalar | AVX-256 | AVX-512 |
|---|---|---|---|
| Doubles per FMA | 1 | 4 | 8 |
| \(j\)-loop iterations | 4 | 1 | 1 |
| Loop overhead | 4 steps | 0 steps | 0 steps |
When the 4-iteration scalar loop becomes one vfmadd256 instruction, all overhead disappears. A 4×4 register tile then exposes four independent FMA chains to the out-of-order scheduler, allowing the two FMA execution ports on modern x86 (Skylake and later) to stay fully occupied.
The BLAS micro-kernel from Goto (2008) is exactly this structure: four independent YMM accumulators (c0, c1, c2, c3) updated in parallel across the \(k\)-loop, written back to C once after the loop.
API: num::matmul_register_blocked
Two SIMD instruction sets are relevant to this library:
| ISA | Register width | Doubles per register | Key FMA instruction |
|---|---|---|---|
| AVX-256 (x86-64) | 256 bits | 4 | _mm256_fmadd_pd |
| ARM NEON (AArch64) | 128 bits | 2 | vfmaq_f64 |
The AVX-256 fused multiply-add \(\_mm256\_fmadd\_pd\) computes \(\mathbf{c} \mathrel{+}= \mathbf{a} \times \mathbf{b}\) for 4 doubles simultaneously in a single instruction. With two FMA execution ports running at full speed the theoretical peak throughput is
\[ \text{Peak GFLOP/s} = 2 \times 8 \times f_{\text{clock}} \]
where the factor of 2 is the FMA (one multiply + one add), 8 is the number of doubles per cycle across both ports (4 doubles × 2 ports), and \(f_{\text{clock}}\) is the sustained clock frequency in GHz. ARM NEON's vfmaq_f64 processes 2 doubles per instruction; on Apple M-series silicon the M core's NEON FMA throughput is correspondingly half that of AVX-256 in terms of doubles per cycle.
For the 4×4 register tile the \(k\)-loop body in AVX-256 is:
Four independent FMA chains allow the CPU's out-of-order engine to keep both FMA ports occupied every cycle. For NEON the same tile uses 8 vfmaq_f64 instructions per \(k\)-step (two NEON vectors cover the 4-wide column).
The CMake build detects the host architecture at configure time and defines one of:
NUMERICS_HAS_AVX2 — x86-64 with AVX2 + FMA3 (enabled via -mavx2 -mfma)NUMERICS_HAS_NEON — AArch64 (NEON is mandatory, no flag needed)The public entry point matmul_simd dispatches to the correct backend at compile time:
The fallback guarantees correct execution on any architecture, which is important for CI pipelines that build on heterogeneous machines. Note that -mfma is required on x86 to emit the fused instruction encoding; without it the compiler emits separate vmulpd + vaddpd pairs — same FLOP count, twice the instruction count.
The Jacobi SVD inner loop involves repeated inner products and Givens column rotations. In the SIMD path these two passes are fused: inner-product partial sums and the rotation update share the same loaded data, halving global-memory traffic for the column pair. The dispatch follows the same NUMERICS_HAS_AVX2 / NUMERICS_HAS_NEON / fallback pattern used by matmul.
All benchmarks are single-threaded, compiled with -O2:
| Kernel | BigO constant | vs naive |
|---|---|---|
| Naive \(ijk\) | 1.19 \(N^3\) | baseline |
| Cache-blocked | 0.18 \(N^3\) | 6.6× |
| SIMD (NEON) | 0.09 \(N^3\) | 13.9× |
| OpenBLAS equivalent | ~0.005 \(N^3\) | remaining gap ~18× |
At \(N = 512\) the SIMD kernel runs in 11 498 µs versus 159 553 µs for naive — a wall-time reduction of 13.9×. The remaining gap to OpenBLAS is closed by wider tiles, B-panel packing (eliminating residual stride penalties), software prefetch, and multithreading.
Matrix-vector multiply is memory-bandwidth bound: every element of A is read exactly once. The SIMD path gives a modest 1.5–1.8× improvement because the bottleneck shifts from instruction throughput to memory bandwidth — no amount of vectorisation can exceed the available bandwidth.
| N | Scalar matvec | SIMD matvec | Speedup |
|---|---|---|---|
| 64 | 1 469 ns | 829 ns | 1.8× |
| 512 | 191 685 ns | 114 969 ns | 1.7× |
| 2048 | 3 744 813 ns | 2 445 977 ns | 1.5× |