numerics 0.1.0
Loading...
Searching...
No Matches
Register Blocking Implementation Note

Register blocking is the scalar micro-kernel layer above cache blocking and below explicit SIMD.

Public Diagnostic Entry Point

void matmul_register_blocked(const Matrix &A, const Matrix &B, Matrix &C, idx block_size=64, idx reg_size=4)
C = A * B (register-blocked)
Definition matrix.cpp:98

The parameters are:

  • block_size: cache tile size.
  • reg_size: small scalar tile size used inside each cache tile.

Micro-Kernel Shape

The implementation accumulates a small \(r\times r\) block in scalar temporaries:

real c[4][4] = {};
for (idx k = kk; k < k_lim; ++k) {
for (idx i = ir; i < ri; ++i) {
const real a_ik = A(i, k);
for (idx j = jr; j < rj; ++j) {
c[i - ir][j - jr] += a_ik * B(k, j);
}
}
}

After the local accumulation, the temporary tile is stored back to C.

Implementation Location

src/core/backends/seq/matrix.cpp

This routine is kept as an implementation diagnostic. The normal public backend selection is still:

void matmul(const Matrix &A, const Matrix &B, Matrix &C, Backend b=default_backend)
C = A * B.
Definition matrix.cpp:20

Benchmark

./build/benchmarks/numerics_bench --benchmark_filter=BM_Matmul

Register blocking is useful for explaining the transition from cache blocking to SIMD micro-kernels. Production code should usually select Backend::blas when a tuned BLAS is available.