numerics
Loading...
Searching...
No Matches
Week 8: Explicit SIMD — AVX-256 and ARM NEON

Overview

Last week we saw that register blocking in scalar C++ showed a regression because shrinking the inner j-loop from 64 to 4 iterations broke auto-vectorisation. The correct fix is to make those 4 iterations one vector instruction — processing all 4 elements simultaneously. That is what explicit SIMD delivers.

This week we implement two backends — AVX-256 for x86-64, ARM NEON for AArch64 — behind a single matmul_simd / matvec_simd API, selected at compile time.

Results on Apple M (ARM NEON, 2 doubles/register):

N Naive Cache-blocked SIMD (NEON) vs Naive vs Cache-blocked
64 161 µs 32 µs 19 µs 8.3× 1.7×
128 1849 µs 263 µs 154 µs 12.0× 1.7×
256 18214 µs 2756 µs 1343 µs 13.6× 2.1×
512 159553 µs 24793 µs 11498 µs 13.9× 2.2×

BigO coefficient: 1.19 → 0.18 → 0.09 N³ — another 2× from SIMD alone.


1. SIMD Fundamentals

SIMD (Single Instruction, Multiple Data) executes one instruction on a vector of elements simultaneously. The two relevant ISA extensions:

ISA Register width Doubles per register Instruction set
AVX-256 (x86-64) 256 bits 4 _mm256_* (<immintrin.h>)
ARM NEON (AArch64) 128 bits 2 v*q_f64 (<arm_neon.h>)
AVX-512 (x86-64) 512 bits 8 _mm512_*

On x86-64, the 256-bit registers are named YMM0–YMM15 (16 registers). On AArch64, the 128-bit SIMD registers are named V0–V31 (32 registers).

The key instruction in both cases is fused multiply-add (FMA):

AVX-256: _mm256_fmadd_pd(a, b, c) → c[0..3] += a[0..3] * b[0..3]
NEON: vfmaq_f64(c, a, b) → c[0..1] += a[0..1] * b[0..1]

One FMA instruction replaces what would be a 4-iteration (or 2-iteration) scalar loop — at no additional latency cost.


2. The Micro-kernel

Why register blocking + SIMD work together

From week 7: register blocking's 4-iteration j-loop in scalar code is too small to vectorise effectively. The same 4-iteration loop maps to one _mm256_fmadd_pd (AVX) or two vfmaq_f64 (NEON) instructions. Now the "loop" has zero iterations of overhead.

The 4×4 register tile processes one k-step as:

// One step of the k-loop (AVX-256 pseudocode):
b = load B[k, jr..jr+3] // 1 vmovupd — 4 doubles
c0 += broadcast(A[ir+0, k]) * b // 1 vbroadcastsd + 1 vfmadd
c1 += broadcast(A[ir+1, k]) * b // 1 vbroadcastsd + 1 vfmadd
c2 += broadcast(A[ir+2, k]) * b // 1 vbroadcastsd + 1 vfmadd
c3 += broadcast(A[ir+3, k]) * b // 1 vbroadcastsd + 1 vfmadd

4 independent FMAs per k-step. With two FMA execution ports on modern x86 CPUs (Skylake and later), the CPU can issue 2 FMAs per cycle → theoretical throughput of 2 × 4 = 8 doubles/cycle.

For NEON, the 4×4 tile is split into two halves (lo = j..j+1, hi = j+2..j+3):

// One step of the k-loop (NEON pseudocode):
blo = load B[k, jr..jr+1] // 2 doubles
bhi = load B[k, jr+2..jr+3] // 2 doubles
c0lo += vdup(A[ir+0,k]) * blo // vfmaq_f64
c0hi += vdup(A[ir+0,k]) * bhi // vfmaq_f64
// ... ×4 rows = 8 vfmaq_f64 per k-step

8 independent NEON FMAs per k-step, each processing 2 doubles.

AVX-256 tile (<tt>src/core/matrix_simd.cpp</tt>)

static inline void avx_tile_4x4(const Matrix& A, const Matrix& B, Matrix& C,
idx ir, idx jr, idx kk, idx k_lim)
{
const idx N = B.cols();
real* Crow = C.data() + ir * N;
// Load 4×4 C tile — 4 YMM registers, no loop overhead
__m256d c0 = _mm256_loadu_pd(Crow + 0 * N + jr);
__m256d c1 = _mm256_loadu_pd(Crow + 1 * N + jr);
__m256d c2 = _mm256_loadu_pd(Crow + 2 * N + jr);
__m256d c3 = _mm256_loadu_pd(Crow + 3 * N + jr);
for (idx k = kk; k < k_lim; ++k) {
__m256d b = _mm256_loadu_pd(B.data() + k * N + jr); // 1 load
c0 = _mm256_fmadd_pd(_mm256_set1_pd(A(ir+0, k)), b, c0); // independent
c1 = _mm256_fmadd_pd(_mm256_set1_pd(A(ir+1, k)), b, c1); // independent
c2 = _mm256_fmadd_pd(_mm256_set1_pd(A(ir+2, k)), b, c2); // independent
c3 = _mm256_fmadd_pd(_mm256_set1_pd(A(ir+3, k)), b, c3); // independent
}
_mm256_storeu_pd(Crow + 0 * N + jr, c0);
// ...
}

c0..c3 are four independent YMM accumulators — the CPU's OOO engine can pipeline all four FMAs simultaneously.

ARM NEON tile (<tt>src/core/matrix_simd.cpp</tt>)

static inline void neon_tile_4x4(const Matrix& A, const Matrix& B, Matrix& C,
idx ir, idx jr, idx kk, idx k_lim)
{
// 4 rows × 2 NEON regs per row = 8 Q-registers for C
float64x2_t c0lo = vld1q_f64(Crow + 0*N + jr);
float64x2_t c0hi = vld1q_f64(Crow + 0*N + jr + 2);
// ... ×4 rows
for (idx k = kk; k < k_lim; ++k) {
float64x2_t blo = vld1q_f64(B.data() + k*N + jr);
float64x2_t bhi = vld1q_f64(B.data() + k*N + jr + 2);
float64x2_t a0 = vdupq_n_f64(A(ir+0, k)); // broadcast
// ...
c0lo = vfmaq_f64(c0lo, a0, blo); // 8 independent FMAs per k step
c0hi = vfmaq_f64(c0hi, a0, bhi);
// ...
}
}

AArch64 has 32 128-bit NEON registers. The 4×4 tile uses 8 for C, 2 for B, 4 for A broadcasts = 14 total. Plenty of headroom.


3. matvec_simd: Dot Product with Horizontal Reduction

Matrix-vector multiply is simpler: each row is a dot product.

AVX-256 version

for (idx i = 0; i < M; ++i) {
__m256d acc = _mm256_setzero_pd();
idx j = 0;
for (; j + 4 <= N; j += 4) {
__m256d a = _mm256_loadu_pd(A.data() + i * N + j);
__m256d xv = _mm256_loadu_pd(x.data() + j);
acc = _mm256_fmadd_pd(a, xv, acc); // acc[0..3] += a[0..3] * x[0..3]
}
// Horizontal sum: acc[0]+acc[1]+acc[2]+acc[3]
__m128d lo = _mm256_castpd256_pd128(acc);
__m128d hi = _mm256_extractf128_pd(acc, 1);
__m128d sum = _mm_add_pd(lo, hi);
sum = _mm_hadd_pd(sum, sum);
y[i] = _mm_cvtsd_f64(sum) + scalar_tail;
}

The horizontal reduction collapses 4 partial sums into 1 at the end of each row — one reduction per row, not per element.

NEON version

float64x2_t acc = vdupq_n_f64(0.0);
for (; j + 2 <= N; j += 2) {
acc = vfmaq_f64(acc, vld1q_f64(A.data() + i*N + j),
vld1q_f64(x.data() + j));
}
// Horizontal sum: acc[0] + acc[1]
real result = vgetq_lane_f64(acc, 0) + vgetq_lane_f64(acc, 1);

vgetq_lane_f64 extracts each lane — two scalar adds instead of a full shuffle sequence.

matvec benchmark results

N Scalar matvec (CPU) SIMD matvec Speedup
64 1469 ns (21.4 GB/s) 829 ns (38.0 GB/s) 1.8×
128 8288 ns (15.0 GB/s) 4641 ns (26.7 GB/s) 1.8×
256 34741 ns (14.2 GB/s) 24035 ns (20.5 GB/s) 1.4×
512 191685 ns (10.2 GB/s) 114969 ns (17.1 GB/s) 1.7×
2048 3744813 ns (8.4 GB/s) 2445977 ns (12.8 GB/s) 1.5×

matvec is memory-bandwidth-bound (every element of A is read once), so the SIMD speedup is modest (~1.5–1.8×) — the bottleneck shifts from compute to memory bus, not from SIMD efficiency.


4. Compile-Time Dispatch

Architecture detection (<tt>CMakeLists.txt</tt>)

include(CheckCXXCompilerFlag)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "x86_64|AMD64|i686")
check_cxx_compiler_flag("-mavx2 -mfma" COMPILER_SUPPORTS_AVX2)
if(COMPILER_SUPPORTS_AVX2)
target_compile_options(numerics PUBLIC -mavx2 -mfma)
target_compile_definitions(numerics PUBLIC NUMERICS_HAS_AVX2)
endif()
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "arm64|aarch64|ARM64|AARCH64")
target_compile_definitions(numerics PUBLIC NUMERICS_HAS_NEON)
endif()

-mavx2 -mfma tells the compiler to emit YMM instructions and enables FMA3 instruction encoding. Without -mfma, _mm256_fmadd_pd compiles to separate MUL + ADD instructions — two instructions instead of one.

On AArch64, NEON is mandatory and always enabled; no flags needed.

Dispatch function (<tt>src/core/matrix_simd.cpp</tt>)

void matmul_simd(const Matrix& A, const Matrix& B, Matrix& C, idx block_size)
{
#if defined(NUMERICS_HAS_AVX2)
matmul_avx(A, B, C, block_size);
#elif defined(NUMERICS_HAS_NEON)
matmul_neon(A, B, C, block_size);
#else
matmul_blocked(A, B, C, block_size); // scalar fallback
#endif
}

The fallback ensures the library compiles and runs correctly on any architecture — important for CI/CD on heterogeneous build farms.


5. Boundary Handling

The tile functions process complete 4×4 blocks. When M or N is not a multiple of 4, boundary rows and columns fall through to scalar loops:

idx ir = ii;
for (; ir + 4 <= i_lim; ir += 4) { // ← full rows
idx jr = jj;
for (; jr + 4 <= j_lim; jr += 4) // ← full 4-wide tiles
simd_tile(A, B, C, ir, jr, kk, k_lim);
for (; jr < j_lim; ++jr) // ← remainder columns: scalar
...
}
for (; ir < i_lim; ++ir) // ← remainder rows: scalar
...

For benchmark sizes that are powers of 2 (64, 128, 256, 512) and a tile width of 4, the boundary path is never reached — all elements fall into full tiles. For production use, the scalar tail handles all cases correctly with no padding required.


6. Complete Benchmark Results

Measured on Apple M (AArch64, NEON), -O2, single-threaded.

Benchmark Time Complexity
------------------------------------------------------------
BM_Matmul_CPU/64 161 µs
BM_Matmul_CPU/512 159553 µs
BM_Matmul_CPU_BigO 1.19 N³ (baseline)
BM_MatmulBlocked_CPU/64 32 µs
BM_MatmulBlocked_CPU/512 24793 µs
BM_MatmulBlocked_CPU_BigO 0.18 N³ (6.6× vs naive)
BM_MatmulSIMD_CPU/64 19 µs
BM_MatmulSIMD_CPU/128 154 µs
BM_MatmulSIMD_CPU/256 1343 µs
BM_MatmulSIMD_CPU/512 11498 µs
BM_MatmulSIMD_CPU_BigO 0.09 N³ (13.9× vs naive, 2.2× vs blocked)

Progression table

Naive i-j-k 1.19 N³ (baseline)
+ cache blocking 0.18 N³ (6.6× gain)
+ SIMD (NEON) 0.09 N³ (2.2× gain on top — 13.9× total)
─────────────────────────────────────────────────────────────────────
OpenBLAS equivalent ~0.005 N³ (remaining gap: ~18×)

The remaining gap to OpenBLAS is closed by:

  1. Wider tiles — NEON's 2-doubles-wide register is half of AVX-256's 4; on x86 the gains would be proportionally larger.
  2. B panel packing — packing the B tile into a contiguous buffer before the k-loop eliminates stride penalties and enables prefetch.
  3. Software prefetch__builtin_prefetch hints load the next B tile into L1 while the CPU is computing the current one.
  4. Multithreading — OpenMP parallel for on the ii loop.

7. Key Takeaways

  1. SIMD turns a 4-iteration loop into one instruction. The register tile from week 7 was the right structure; it just needed vector width to pay off.
  2. AVX-256 and NEON have the same tile structure, different widths. AVX processes 4 doubles per FMA; NEON processes 2. The code is identical except for the intrinsic names and register counts.
  3. Compile-time dispatch keeps the API clean. Users call matmul_simd and get the best available backend without #ifdef scattered through application code.
  4. matvec is bandwidth-bound, not compute-bound. SIMD gives ~1.6× here because the bottleneck is memory bandwidth, not FLOPs. No amount of vectorisation can exceed memory bandwidth.
  5. FMA requires an explicit compiler flag on x86. -mfma enables the fused instruction encoding. Omitting it causes the compiler to emit separate MUL + ADD — same FLOP count, 2× instruction count.

8. Exercises

  1. On x86, compile matmul_simd with and without -mfma. Disassemble the inner loop with objdump -d. Count vfmadd instructions vs vmul + vadd pairs.
  2. Try a 2×8 register tile for AVX (2 rows, 8 columns = 2 YMM regs per row). Does it beat the 4×4 tile? What limits the tile height?
  3. Add a matmul_simd benchmark with block_size = 32. Does L1 vs L2 residency change the result now that the inner loop is SIMD-accelerated?
  4. Implement a SIMD dot for Vector using the same horizontal-reduce pattern from matvec_simd. Benchmark against the scalar dot.
  5. *(Advanced)* Add #pragma omp parallel for to the outer ii loop of matmul_simd. What speedup do you observe on 4 cores? Is it linear? What limits scaling?

References