|
numerics
|
This week we move from algorithmic complexity to machine complexity. Two programs can perform the same number of floating-point operations and yet run 10× apart in wall time, because one of them respects the cache hierarchy and the other does not.
Dense matrix multiplication is the canonical example. Its naive implementation is O(N³) in FLOPs and in cache misses — and the cache misses dominate. A cache-blocked version achieves the same asymptotic complexity but cuts the constant by 6× without any parallelism or SIMD, purely through data reuse.
Every modern CPU has a multi-level cache between the processor registers and main memory. Each level trades capacity for latency:
A cache miss occurs when data is not in cache and must be fetched from a slower level. A single L3 miss costs ~200 cycles — enough time to execute ~200 floating-point multiply-adds. Eliminating unnecessary misses is often more impactful than reducing FLOPs.
Data moves between levels in fixed-size chunks called cache lines (typically 64 bytes = 8 doubles). When you load one double from DRAM, you pay for a full 64-byte transfer. Sequential access patterns amortise that cost across 8 elements; random access wastes 7 of them.
Arithmetic intensity (AI) measures how much computation you extract per byte moved from memory:
\[\text{AI} = \frac{\text{FLOPs}}{\text{bytes transferred}}\]
The roofline model says a kernel's performance is bounded by whichever limit it hits first:
\[\text{Performance} \leq \min\!\left(\text{Peak FLOP/s},\ \text{AI} \times \text{Memory BW}\right)\]
For naive N×N matrix multiply:
| Quantity | Value |
|---|---|
| FLOPs | \(2N^3\) (N³ multiplies + N³ adds) |
| Bytes read (naive) | \(\approx 2N^3 \times 8\) (B read N² times, once per column element per row) |
| Arithmetic intensity | \(\approx 0.125\) FLOP/byte |
0.125 FLOP/byte is deep in the memory-bandwidth-bound regime on every known CPU. A typical server has 50 GB/s of memory bandwidth and 1 TFLOP/s of peak compute — the ridge point (where compute and memory limits meet) is at 50/1000 × 1e9 = ~20 FLOP/byte. We are 160× below it.
Blocking raises the arithmetic intensity toward that ridge point by reusing data already in cache rather than re-fetching it from DRAM.
All three matrices are stored row-major (C-style): element M(i, j) sits at address base + (i * cols + j) * sizeof(double).
Trace the inner k-loop for a fixed output element C(i, j):
| Access | Address stride as k increases | Cache behaviour |
|---|---|---|
A(i, k) | +8 bytes (sequential) | ✓ prefetchable, stays in L1 |
B(k, j) | +N × 8 bytes (column jump) | ✗ one new cache line per step |
C(i, j) | 0 (scalar accumulator) | ✓ register |
B(k, j) for fixed j and varying k is a column of B — elements spaced N doubles apart. For N = 512, that stride is 4 KB per step. A 32 KB L1 cache holds 512 doubles. After 512 steps the entire L1 has been cycled through for column j — and then the next j starts the same eviction pattern from scratch. B is read from DRAM O(N) times instead of once.
For N = 512 the naive loop reads B ≈ 512 × (512 × 512 × 8) / 10⁹ ≈ 1 GB from DRAM. With 50 GB/s bandwidth that is 20 ms of stalls — on top of the actual arithmetic.
Instead of iterating over entire rows and columns, tile A, B and C into BLOCK×BLOCK sub-matrices and compute one tile at a time:
The working set for three BLOCK×BLOCK tiles of doubles is:
\[\text{working set} = 3 \times \text{BLOCK}^2 \times 8\ \text{bytes}\]
| BLOCK | Working set | Fits in |
|---|---|---|
| 32 | 24 KB | L1 (32 KB) |
| 64 | 98 KB | L2 (256 KB) |
| 128 | 393 KB | L2 (512 KB) or L3 |
Pick BLOCK so the working set fits in the cache level you want to exploit. BLOCK = 64 is a robust default for server CPUs with 256 KB L2.
Within each tile, B's elements are accessed left-to-right along rows of a BLOCK×BLOCK sub-matrix — sequential, fully prefetchable, and staying in L2 for the duration of the tile computation. The column-stride problem disappears.
Why ii → jj → kk and not some other order?
For fixed (ii, jj), the C tile C[ii:i_end, jj:j_end] is updated by every kk block. With the ii-jj-kk ordering it stays in L2 across all kk iterations — loaded once, updated BLOCK times, written back once. If we swapped to ii-kk-jj the C tile would be evicted between jj iterations and reloaded from L2/L3 each time, losing half the benefit.
Why i → k → j and not i → j → k?
Compare the two orderings inside the tile:
In i-j-k, B(k,j) for fixed j and varying k is again a column stride. We have replicated the original problem within the tile.
In i-k-j, A(i,k) is a scalar that the compiler hoists before the j loop. The innermost loop becomes a scaled vector addition (AXPY):
\[\mathbf{c}_{i,jj:j\_end} \mathrel{+}= a_{ik} \cdot \mathbf{b}_{k,jj:j\_end}\]
Both B(k,j) and C(i,j) are read/written left-to-right along rows — sequential memory access, unit stride. The compiler can auto-vectorise this loop with SIMD instructions (AVX-256 processes 4 doubles at once; AVX-512 processes 8).
Both functions are kept side-by-side so you can benchmark them directly. block_size is a runtime parameter so experiments at different tile sizes do not require recompilation.
The boundary guards (std::min) handle matrices whose dimensions are not exact multiples of block_size — no padding required.
Measured on an 8-core machine (L1d 64 KB, L2 4 MB per core). Compiled with -O2. Single-threaded, CPU only.
| N | Naive | Blocked | Speedup |
|---|---|---|---|
| 64 | 160 µs | 32 µs | 5.0× |
| 128 | 1842 µs | 263 µs | 7.0× |
| 256 | 18135 µs | 2761 µs | 6.6× |
| 512 | 158131 µs | 25036 µs | 6.3× |
The BigO coefficient drops from 1.18 → 0.19 — a consistent **~6.3×** improvement. Both are still O(N³); the constant changes because the number of cache misses per FLOP falls dramatically.
Reproduce with:
Benchmark source: benchmarks/bench_linalg.cpp:93–135
Let N = 512, BLOCK = 64, cache line = 64 bytes (8 doubles).
For each of N² output elements C(i,j), the inner k-loop reads the entire column B(:, j) — N elements, all on different cache lines (stride 4 KB >> 64 bytes). Total B cache lines loaded:
\[\frac{N^2 \times N}{8} = \frac{512^3}{8} \approx 16.8 \times 10^6\ \text{lines}\]
Each cache line is 64 bytes → 1.07 GB of B transferred from DRAM (for N=512, the matrix is only 2 MB). B is re-read ~500 times.
Each B tile of size BLOCK×BLOCK is loaded once per (ii, kk) pair and reused across all jj iterations for that (ii, kk). Wait — actually in the ii-jj-kk ordering a B tile is loaded once per (ii, jj) pair... let me be precise.
For fixed (ii, jj, kk), the B tile B[kk:kk+B, jj:jj+B] is loaded once and used for all i in [ii, ii+B). It is loaded again the next time that (jj, kk) combination appears, which is for the next ii block. So each B tile is loaded M/BLOCK times total.
Total B cache lines loaded:
\[\frac{(N/\text{BLOCK})^3 \times \text{BLOCK}^2}{8} = \frac{N^3}{8 \times \text{BLOCK}}\]
For N=512, BLOCK=64:
\[\frac{512^3}{8 \times 64} = \frac{16.8 \times 10^6}{64} \approx 262\,000\ \text{lines}\]
That is 64× fewer cache lines — matching the speedup order of magnitude. (The measured 6× rather than 64× reflects that the naive loop benefits from L2/L3 caching for small N, and the blocked loop has its own overhead for boundary handling and loop control.)
The goal is to fit three tiles in your target cache level.
| Cache level | Typical size | Optimal BLOCK |
|---|---|---|
| L1 (32 KB) | 32,768 bytes | 37 → use 32 |
| L1 (64 KB) | 65,536 bytes | 52 → use 48 or 64 |
| L2 (256 KB) | 262,144 bytes | 104 → use 64 or 96 |
| L2 (512 KB) | 524,288 bytes | 148 → use 128 |
Round down to a power of 2 or multiple of the SIMD width (4 doubles for AVX-256, 8 for AVX-512) to help the vectoriser.
In practice, BLOCK = 64 is the safe default for modern server CPUs with at least 256 KB L2. For a tuning exercise, benchmark matmul_blocked at BLOCK = 32, 48, 64, 96, 128 on your specific machine and pick the knee of the curve.
This week's implementation leaves significant performance on the table. The 0.19 N³ constant compares to OpenBLAS's ~0.005 N³ — roughly 40× gap remaining. That gap is closed in two further steps:
Instead of accumulating into C(i,j) in memory, accumulate into a small register tile (e.g. 4×4 or 8×4 doubles) declared as local variables. This eliminates the C load/store in the inner loop entirely and raises register reuse. The inner loop transforms from:
to accumulating into double c00, c01, c02, c03, … and writing back to C only after the micro-kernel completes.
The compiler sometimes auto-vectorises the j loop, but it cannot always prove safety (aliasing, alignment). Explicit SIMD intrinsics (AVX-256 on x86-64, NEON on ARMv8) guarantee vectorisation and unlock fused multiply-add (FMA) instructions:
Each FMA processes 4 doubles simultaneously. With 8 AVX-512 FMAs per cycle and two ports, peak throughput on a modern CPU is 2 × 8 = 16 doubles/cycle.
i → k → j inner order hoists a scalar into a register and leaves the innermost loop as a sequential AXPY — the shape the compiler needs to auto-vectorise.BM_MatmulBlocked_CPU with block_size set to 32, 48, 64, 96, 128 and plot wall time vs block size. Identify the optimal value for your machine. Does it match the formula in §9?matmul_blocked call to the CG solver benchmark and measure whether CG benefits (it uses matvec, not matmul — what does that tell you about where to apply blocking next?).perf stat -e cache-misses on Linux. Verify that the cache-miss count drops by roughly the factor predicted in §8.#pragma omp parallel for to the outer ii loop of matmul_blocked. What speedup do you observe on 4 threads? Is it linear? What limits it?