Overview
Last week we saw that register blocking in scalar C++ showed a regression because shrinking the inner j-loop from 64 to 4 iterations broke auto-vectorisation. The correct fix is to make those 4 iterations one vector instruction — processing all 4 elements simultaneously. That is what explicit SIMD delivers.
This week we implement two backends — AVX-256 for x86-64, ARM NEON for AArch64 — behind a single matmul_simd / matvec_simd API, selected at compile time.
Results on Apple M (ARM NEON, 2 doubles/register):
| N | Naive | Cache-blocked | SIMD (NEON) | vs Naive | vs Cache-blocked |
| 64 | 161 µs | 32 µs | 19 µs | 8.3× | 1.7× |
| 128 | 1849 µs | 263 µs | 154 µs | 12.0× | 1.7× |
| 256 | 18214 µs | 2756 µs | 1343 µs | 13.6× | 2.1× |
| 512 | 159553 µs | 24793 µs | 11498 µs | 13.9× | 2.2× |
BigO coefficient: 1.19 → 0.18 → 0.09 N³ — another 2× from SIMD alone.
1. SIMD Fundamentals
SIMD (Single Instruction, Multiple Data) executes one instruction on a vector of elements simultaneously. The two relevant ISA extensions:
| ISA | Register width | Doubles per register | Instruction set |
| AVX-256 (x86-64) | 256 bits | 4 | _mm256_* (<immintrin.h>) |
| ARM NEON (AArch64) | 128 bits | 2 | v*q_f64 (<arm_neon.h>) |
| AVX-512 (x86-64) | 512 bits | 8 | _mm512_* |
On x86-64, the 256-bit registers are named YMM0–YMM15 (16 registers). On AArch64, the 128-bit SIMD registers are named V0–V31 (32 registers).
The key instruction in both cases is fused multiply-add (FMA):
AVX-256: _mm256_fmadd_pd(a, b, c) → c[0..3] += a[0..3] * b[0..3]
NEON: vfmaq_f64(c, a, b) → c[0..1] += a[0..1] * b[0..1]
One FMA instruction replaces what would be a 4-iteration (or 2-iteration) scalar loop — at no additional latency cost.
2. The Micro-kernel
Why register blocking + SIMD work together
From week 7: register blocking's 4-iteration j-loop in scalar code is too small to vectorise effectively. The same 4-iteration loop maps to one _mm256_fmadd_pd (AVX) or two vfmaq_f64 (NEON) instructions. Now the "loop" has zero iterations of overhead.
The 4×4 register tile processes one k-step as:
// One step of the k-loop (AVX-256 pseudocode):
b = load B[k, jr..jr+3] // 1 vmovupd — 4 doubles
c0 += broadcast(A[ir+0, k]) * b // 1 vbroadcastsd + 1 vfmadd
c1 += broadcast(A[ir+1, k]) * b // 1 vbroadcastsd + 1 vfmadd
c2 += broadcast(A[ir+2, k]) * b // 1 vbroadcastsd + 1 vfmadd
c3 += broadcast(A[ir+3, k]) * b // 1 vbroadcastsd + 1 vfmadd
4 independent FMAs per k-step. With two FMA execution ports on modern x86 CPUs (Skylake and later), the CPU can issue 2 FMAs per cycle → theoretical throughput of 2 × 4 = 8 doubles/cycle.
For NEON, the 4×4 tile is split into two halves (lo = j..j+1, hi = j+2..j+3):
// One step of the k-loop (NEON pseudocode):
blo = load B[k, jr..jr+1] // 2 doubles
bhi = load B[k, jr+2..jr+3] // 2 doubles
c0lo += vdup(A[ir+0,k]) * blo // vfmaq_f64
c0hi += vdup(A[ir+0,k]) * bhi // vfmaq_f64
// ... ×4 rows = 8 vfmaq_f64 per k-step
8 independent NEON FMAs per k-step, each processing 2 doubles.
AVX-256 tile (<tt>src/core/matrix_simd.cpp</tt>)
static inline void avx_tile_4x4(const Matrix& A, const Matrix& B, Matrix& C,
idx ir, idx jr, idx kk, idx k_lim)
{
const idx N = B.cols();
real* Crow = C.data() + ir * N;
__m256d c0 = _mm256_loadu_pd(Crow + 0 * N + jr);
__m256d c1 = _mm256_loadu_pd(Crow + 1 * N + jr);
__m256d c2 = _mm256_loadu_pd(Crow + 2 * N + jr);
__m256d c3 = _mm256_loadu_pd(Crow + 3 * N + jr);
for (idx k = kk; k < k_lim; ++k) {
__m256d b = _mm256_loadu_pd(B.data() + k * N + jr);
c0 = _mm256_fmadd_pd(_mm256_set1_pd(A(ir+0, k)), b, c0);
c1 = _mm256_fmadd_pd(_mm256_set1_pd(A(ir+1, k)), b, c1);
c2 = _mm256_fmadd_pd(_mm256_set1_pd(A(ir+2, k)), b, c2);
c3 = _mm256_fmadd_pd(_mm256_set1_pd(A(ir+3, k)), b, c3);
}
_mm256_storeu_pd(Crow + 0 * N + jr, c0);
}
c0..c3 are four independent YMM accumulators — the CPU's OOO engine can pipeline all four FMAs simultaneously.
ARM NEON tile (<tt>src/core/matrix_simd.cpp</tt>)
static inline void neon_tile_4x4(const Matrix& A, const Matrix& B, Matrix& C,
idx ir, idx jr, idx kk, idx k_lim)
{
float64x2_t c0lo = vld1q_f64(Crow + 0*N + jr);
float64x2_t c0hi = vld1q_f64(Crow + 0*N + jr + 2);
for (idx k = kk; k < k_lim; ++k) {
float64x2_t blo = vld1q_f64(B.data() + k*N + jr);
float64x2_t bhi = vld1q_f64(B.data() + k*N + jr + 2);
float64x2_t a0 = vdupq_n_f64(A(ir+0, k));
c0lo = vfmaq_f64(c0lo, a0, blo);
c0hi = vfmaq_f64(c0hi, a0, bhi);
}
}
AArch64 has 32 128-bit NEON registers. The 4×4 tile uses 8 for C, 2 for B, 4 for A broadcasts = 14 total. Plenty of headroom.
3. matvec_simd: Dot Product with Horizontal Reduction
Matrix-vector multiply is simpler: each row is a dot product.
AVX-256 version
for (idx i = 0; i < M; ++i) {
__m256d acc = _mm256_setzero_pd();
idx j = 0;
for (; j + 4 <= N; j += 4) {
__m256d a = _mm256_loadu_pd(A.data() + i * N + j);
__m256d xv = _mm256_loadu_pd(x.data() + j);
acc = _mm256_fmadd_pd(a, xv, acc);
}
__m128d lo = _mm256_castpd256_pd128(acc);
__m128d hi = _mm256_extractf128_pd(acc, 1);
__m128d sum = _mm_add_pd(lo, hi);
sum = _mm_hadd_pd(sum, sum);
y[i] = _mm_cvtsd_f64(sum) + scalar_tail;
}
The horizontal reduction collapses 4 partial sums into 1 at the end of each row — one reduction per row, not per element.
NEON version
float64x2_t acc = vdupq_n_f64(0.0);
for (; j + 2 <= N; j += 2) {
acc = vfmaq_f64(acc, vld1q_f64(A.data() + i*N + j),
vld1q_f64(x.data() + j));
}
real result = vgetq_lane_f64(acc, 0) + vgetq_lane_f64(acc, 1);
vgetq_lane_f64 extracts each lane — two scalar adds instead of a full shuffle sequence.
matvec benchmark results
| N | Scalar matvec (CPU) | SIMD matvec | Speedup |
| 64 | 1469 ns (21.4 GB/s) | 829 ns (38.0 GB/s) | 1.8× |
| 128 | 8288 ns (15.0 GB/s) | 4641 ns (26.7 GB/s) | 1.8× |
| 256 | 34741 ns (14.2 GB/s) | 24035 ns (20.5 GB/s) | 1.4× |
| 512 | 191685 ns (10.2 GB/s) | 114969 ns (17.1 GB/s) | 1.7× |
| 2048 | 3744813 ns (8.4 GB/s) | 2445977 ns (12.8 GB/s) | 1.5× |
matvec is memory-bandwidth-bound (every element of A is read once), so the SIMD speedup is modest (~1.5–1.8×) — the bottleneck shifts from compute to memory bus, not from SIMD efficiency.
4. Compile-Time Dispatch
Architecture detection (<tt>CMakeLists.txt</tt>)
include(CheckCXXCompilerFlag)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "x86_64|AMD64|i686")
check_cxx_compiler_flag("-mavx2 -mfma" COMPILER_SUPPORTS_AVX2)
if(COMPILER_SUPPORTS_AVX2)
target_compile_options(numerics PUBLIC -mavx2 -mfma)
target_compile_definitions(numerics PUBLIC NUMERICS_HAS_AVX2)
endif()
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64|aarch64|ARM64|AARCH64")
target_compile_definitions(numerics PUBLIC NUMERICS_HAS_NEON)
endif()
-mavx2 -mfma tells the compiler to emit YMM instructions and enables FMA3 instruction encoding. Without -mfma, _mm256_fmadd_pd compiles to separate MUL + ADD instructions — two instructions instead of one.
On AArch64, NEON is mandatory and always enabled; no flags needed.
Dispatch function (<tt>src/core/matrix_simd.cpp</tt>)
void matmul_simd(const Matrix& A, const Matrix& B, Matrix& C, idx block_size)
{
#if defined(NUMERICS_HAS_AVX2)
matmul_avx(A, B, C, block_size);
#elif defined(NUMERICS_HAS_NEON)
matmul_neon(A, B, C, block_size);
#else
matmul_blocked(A, B, C, block_size);
#endif
}
The fallback ensures the library compiles and runs correctly on any architecture — important for CI/CD on heterogeneous build farms.
5. Boundary Handling
The tile functions process complete 4×4 blocks. When M or N is not a multiple of 4, boundary rows and columns fall through to scalar loops:
idx ir = ii;
for (; ir + 4 <= i_lim; ir += 4) {
idx jr = jj;
for (; jr + 4 <= j_lim; jr += 4)
simd_tile(A, B, C, ir, jr, kk, k_lim);
for (; jr < j_lim; ++jr)
...
}
for (; ir < i_lim; ++ir)
...
For benchmark sizes that are powers of 2 (64, 128, 256, 512) and a tile width of 4, the boundary path is never reached — all elements fall into full tiles. For production use, the scalar tail handles all cases correctly with no padding required.
6. Complete Benchmark Results
Measured on Apple M (AArch64, NEON), -O2, single-threaded.
Benchmark Time Complexity
------------------------------------------------------------
BM_Matmul_CPU/64 161 µs
BM_Matmul_CPU/512 159553 µs
BM_Matmul_CPU_BigO 1.19 N³ (baseline)
BM_MatmulBlocked_CPU/64 32 µs
BM_MatmulBlocked_CPU/512 24793 µs
BM_MatmulBlocked_CPU_BigO 0.18 N³ (6.6× vs naive)
BM_MatmulSIMD_CPU/64 19 µs
BM_MatmulSIMD_CPU/128 154 µs
BM_MatmulSIMD_CPU/256 1343 µs
BM_MatmulSIMD_CPU/512 11498 µs
BM_MatmulSIMD_CPU_BigO 0.09 N³ (13.9× vs naive, 2.2× vs blocked)
Progression table
Naive i-j-k 1.19 N³ (baseline)
+ cache blocking 0.18 N³ (6.6× gain)
+ SIMD (NEON) 0.09 N³ (2.2× gain on top — 13.9× total)
─────────────────────────────────────────────────────────────────────
OpenBLAS equivalent ~0.005 N³ (remaining gap: ~18×)
The remaining gap to OpenBLAS is closed by:
- Wider tiles — NEON's 2-doubles-wide register is half of AVX-256's 4; on x86 the gains would be proportionally larger.
- B panel packing — packing the B tile into a contiguous buffer before the k-loop eliminates stride penalties and enables prefetch.
- Software prefetch —
__builtin_prefetch hints load the next B tile into L1 while the CPU is computing the current one.
- Multithreading — OpenMP
parallel for on the ii loop.
7. Key Takeaways
- SIMD turns a 4-iteration loop into one instruction. The register tile from week 7 was the right structure; it just needed vector width to pay off.
- AVX-256 and NEON have the same tile structure, different widths. AVX processes 4 doubles per FMA; NEON processes 2. The code is identical except for the intrinsic names and register counts.
- Compile-time dispatch keeps the API clean. Users call
matmul_simd and get the best available backend without #ifdef scattered through application code.
- matvec is bandwidth-bound, not compute-bound. SIMD gives ~1.6× here because the bottleneck is memory bandwidth, not FLOPs. No amount of vectorisation can exceed memory bandwidth.
- FMA requires an explicit compiler flag on x86.
-mfma enables the fused instruction encoding. Omitting it causes the compiler to emit separate MUL + ADD — same FLOP count, 2× instruction count.
8. Exercises
- On x86, compile
matmul_simd with and without -mfma. Disassemble the inner loop with objdump -d. Count vfmadd instructions vs vmul + vadd pairs.
- Try a 2×8 register tile for AVX (2 rows, 8 columns = 2 YMM regs per row). Does it beat the 4×4 tile? What limits the tile height?
- Add a
matmul_simd benchmark with block_size = 32. Does L1 vs L2 residency change the result now that the inner loop is SIMD-accelerated?
- Implement a SIMD
dot for Vector using the same horizontal-reduce pattern from matvec_simd. Benchmark against the scalar dot.
- *(Advanced)* Add
#pragma omp parallel for to the outer ii loop of matmul_simd. What speedup do you observe on 4 cores? Is it linear? What limits scaling?
References