|
numerics 0.1.0
|
This week we move from algorithmic complexity to machine complexity. Two programs can perform the same number of floating-point operations and yet run 10x apart in wall time, because one of them respects the cache hierarchy and the other does not.
Dense matrix multiplication is the canonical example. Its naive implementation is O(N^3) in FLOPs and in cache misses – and the cache misses dominate. A cache-blocked version achieves the same asymptotic complexity but cuts the constant by 6x without any parallelism or SIMD, purely through data reuse.
Every modern CPU has a multi-level cache between the processor registers and main memory. Each level trades capacity for latency:
A cache miss occurs when data is not in cache and must be fetched from a slower level. A single L3 miss costs ~200 cycles – enough time to execute ~200 floating-point multiply-adds. Eliminating unnecessary misses is often more impactful than reducing FLOPs.
Data moves between levels in fixed-size chunks called cache lines (typically 64 bytes = 8 doubles). When you load one double from DRAM, you pay for a full 64-byte transfer. Sequential access patterns amortise that cost across 8 elements; random access wastes 7 of them.
Arithmetic intensity (AI) measures how much computation you extract per byte moved from memory:
\[\text{AI} = \frac{\text{FLOPs}}{\text{bytes transferred}}\]
The roofline model says a kernel's performance is bounded by whichever limit it hits first:
\[\text{Performance} \leq \min\!\left(\text{Peak FLOP/s},\ \text{AI} \times \text{Memory BW}\right)\]
For naive NxN matrix multiply:
| Quantity | Value |
|---|---|
| FLOPs | \(2N^3\) (N^3 multiplies + N^3 adds) |
| Bytes read (naive) | \(\approx 2N^3 \times 8\) (B read N^2 times, once per column element per row) |
| Arithmetic intensity | \(\approx 0.125\) FLOP/byte |
0.125 FLOP/byte is deep in the memory-bandwidth-bound regime on every known CPU. A typical server has 50 GB/s of memory bandwidth and 1 TFLOP/s of peak compute – the ridge point (where compute and memory limits meet) is at 50/1000 x 1e9 = ~20 FLOP/byte. We are 160x below it.
Blocking raises the arithmetic intensity toward that ridge point by reusing data already in cache rather than re-fetching it from DRAM.
All three matrices are stored row-major (C-style): element M(i, j) sits at address base + (i * cols + j) * sizeof(double).
Trace the inner k-loop for a fixed output element C(i, j):
| Access | Address stride as k increases | Cache behaviour |
|---|---|---|
A(i, k) | +8 bytes (sequential) | ok prefetchable, stays in L1 |
B(k, j) | +N x 8 bytes (column jump) | x one new cache line per step |
C(i, j) | 0 (scalar accumulator) | ok register |
B(k, j) for fixed j and varying k is a column of B – elements spaced N doubles apart. For N = 512, that stride is 4 KB per step. A 32 KB L1 cache holds 512 doubles. After 512 steps the entire L1 has been cycled through for column j – and then the next j starts the same eviction pattern from scratch. B is read from DRAM O(N) times instead of once.
For N = 512 the naive loop reads B ~= 512 x (512 x 512 x 8) / 10^9 ~= 1 GB from DRAM. With 50 GB/s bandwidth that is 20 ms of stalls – on top of the actual arithmetic.
Instead of iterating over entire rows and columns, tile A, B and C into BLOCKxBLOCK sub-matrices and compute one tile at a time:
The working set for three BLOCKxBLOCK tiles of doubles is:
\[\text{working set} = 3 \times \text{BLOCK}^2 \times 8\ \text{bytes}\]
| BLOCK | Working set | Fits in |
|---|---|---|
| 32 | 24 KB | L1 (32 KB) |
| 64 | 98 KB | L2 (256 KB) |
| 128 | 393 KB | L2 (512 KB) or L3 |
Pick BLOCK so the working set fits in the cache level you want to exploit. BLOCK = 64 is a robust default for server CPUs with 256 KB L2.
Within each tile, B's elements are accessed left-to-right along rows of a BLOCKxBLOCK sub-matrix – sequential, fully prefetchable, and staying in L2 for the duration of the tile computation. The column-stride problem disappears.
Why ii -> jj -> kk and not some other order?
For fixed (ii, jj), the C tile C[ii:i_end, jj:j_end] is updated by every kk block. With the ii-jj-kk ordering it stays in L2 across all kk iterations – loaded once, updated BLOCK times, written back once. If we swapped to ii-kk-jj the C tile would be evicted between jj iterations and reloaded from L2/L3 each time, losing half the benefit.
Why i -> k -> j and not i -> j -> k?
Compare the two orderings inside the tile:
In i-j-k, B(k,j) for fixed j and varying k is again a column stride. We have replicated the original problem within the tile.
In i-k-j, A(i,k) is a scalar that the compiler hoists before the j loop. The innermost loop becomes a scaled vector addition (AXPY):
\[\mathbf{c}_{i,jj:j\_end} \mathrel{+}= a_{ik} \cdot \mathbf{b}_{k,jj:j\_end}\]
Both B(k,j) and C(i,j) are read/written left-to-right along rows – sequential memory access, unit stride. The compiler can auto-vectorise this loop with SIMD instructions (AVX-256 processes 4 doubles at once; AVX-512 processes 8).
Both functions are kept side-by-side so you can benchmark them directly. block_size is a runtime parameter so experiments at different tile sizes do not require recompilation.
The boundary guards (std::min) handle matrices whose dimensions are not exact multiples of block_size – no padding required.
Measured on an 8-core machine (L1d 64 KB, L2 4 MB per core). Compiled with -O2. Single-threaded, CPU only.
| N | Naive | Blocked | Speedup |
|---|---|---|---|
| 64 | 160 mus | 32 mus | 5.0x |
| 128 | 1842 mus | 263 mus | 7.0x |
| 256 | 18135 mus | 2761 mus | 6.6x |
| 512 | 158131 mus | 25036 mus | 6.3x |
The BigO coefficient drops from 1.18 -> 0.19 – a consistent **~6.3x** improvement. Both are still O(N^3); the constant changes because the number of cache misses per FLOP falls dramatically.
Reproduce with:
Benchmark source: benchmarks/bench_linalg.cpp:93-135
Let N = 512, BLOCK = 64, cache line = 64 bytes (8 doubles).
For each of N^2 output elements C(i,j), the inner k-loop reads the entire column B(:, j) – N elements, all on different cache lines (stride 4 KB >> 64 bytes). Total B cache lines loaded:
\[\frac{N^2 \times N}{8} = \frac{512^3}{8} \approx 16.8 \times 10^6\ \text{lines}\]
Each cache line is 64 bytes -> 1.07 GB of B transferred from DRAM (for N=512, the matrix is only 2 MB). B is re-read ~500 times.
Each B tile of size BLOCKxBLOCK is loaded once per (ii, kk) pair and reused across all jj iterations for that (ii, kk). Wait – actually in the ii-jj-kk ordering a B tile is loaded once per (ii, jj) pair... let me be precise.
For fixed (ii, jj, kk), the B tile B[kk:kk+B, jj:jj+B] is loaded once and used for all i in [ii, ii+B). It is loaded again the next time that (jj, kk) combination appears, which is for the next ii block. So each B tile is loaded M/BLOCK times total.
Total B cache lines loaded:
\[\frac{(N/\text{BLOCK})^3 \times \text{BLOCK}^2}{8} = \frac{N^3}{8 \times \text{BLOCK}}\]
For N=512, BLOCK=64:
\[\frac{512^3}{8 \times 64} = \frac{16.8 \times 10^6}{64} \approx 262\,000\ \text{lines}\]
That is 64x fewer cache lines – matching the speedup order of magnitude. (The measured 6x rather than 64x reflects that the naive loop benefits from L2/L3 caching for small N, and the blocked loop has its own overhead for boundary handling and loop control.)
The goal is to fit three tiles in your target cache level.
| Cache level | Typical size | Optimal BLOCK |
|---|---|---|
| L1 (32 KB) | 32,768 bytes | 37 -> use 32 |
| L1 (64 KB) | 65,536 bytes | 52 -> use 48 or 64 |
| L2 (256 KB) | 262,144 bytes | 104 -> use 64 or 96 |
| L2 (512 KB) | 524,288 bytes | 148 -> use 128 |
Round down to a power of 2 or multiple of the SIMD width (4 doubles for AVX-256, 8 for AVX-512) to help the vectoriser.
In practice, BLOCK = 64 is the safe default for modern server CPUs with at least 256 KB L2. For a tuning exercise, benchmark matmul_blocked at BLOCK = 32, 48, 64, 96, 128 on your specific machine and pick the knee of the curve.
This week's implementation leaves significant performance on the table. The 0.19 N^3 constant compares to OpenBLAS's ~0.005 N^3 – roughly 40x gap remaining. That gap is closed in two further steps:
Instead of accumulating into C(i,j) in memory, accumulate into a small register tile (e.g. 4x4 or 8x4 doubles) declared as local variables. This eliminates the C load/store in the inner loop entirely and raises register reuse. The inner loop transforms from:
to accumulating into double c00, c01, c02, c03, ... and writing back to C only after the micro-kernel completes.
The compiler sometimes auto-vectorises the j loop, but it cannot always prove safety (aliasing, alignment). Explicit SIMD intrinsics (AVX-256 on x86-64, NEON on ARMv8) guarantee vectorisation and unlock fused multiply-add (FMA) instructions:
Each FMA processes 4 doubles simultaneously. With 8 AVX-512 FMAs per cycle and two ports, peak throughput on a modern CPU is 2 x 8 = 16 doubles/cycle.
i -> k -> j inner order hoists a scalar into a register and leaves the innermost loop as a sequential AXPY – the shape the compiler needs to auto-vectorise.BM_MatmulBlocked_CPU with block_size set to 32, 48, 64, 96, 128 and plot wall time vs block size. Identify the optimal value for your machine. Does it match the formula in Sec.9?matmul_blocked call to the CG solver benchmark and measure whether CG benefits (it uses matvec, not matmul – what does that tell you about where to apply blocking next?).perf stat -e cache-misses on Linux. Verify that the cache-miss count drops by roughly the factor predicted in Sec.8.#pragma omp parallel for to the outer ii loop of matmul_blocked. What speedup do you observe on 4 threads? Is it linear? What limits it?