Register blocking is the scalar micro-kernel layer above cache blocking and below explicit SIMD.

Public Diagnostic Entry Point

num::matmul_register_blocked(A, B, C, 64, 4);

num::matmul_register_blocked

void matmul_register_blocked(const Matrix &A, const Matrix &B, Matrix &C, idx block_size=64, idx reg_size=4)

C = A * B (register-blocked)

Definition matrix.cpp:98

The parameters are:

block_size: cache tile size.
reg_size: small scalar tile size used inside each cache tile.

Micro-Kernel Shape

The implementation accumulates a small \(r\times r\) block in scalar temporaries:

real c[4][4] = {};
 
for (idx k = kk; k < k_lim; ++k) {
    for (idx i = ir; i < ri; ++i) {
        const real a_ik = A(i, k);
        for (idx j = jr; j < rj; ++j) {
            c[i - ir][j - jr] += a_ik * B(k, j);
        }
    }
}

After the local accumulation, the temporary tile is stored back to C.

Implementation Location

src/core/backends/seq/matrix.cpp

This routine is kept as an implementation diagnostic. The normal public backend selection is still:

num::matmul(A, B, C, num::Backend::blocked);
num::matmul(A, B, C, num::Backend::simd);
num::matmul(A, B, C, num::Backend::blas);

Benchmark

./build/benchmarks/numerics_bench --benchmark_filter=BM_Matmul

Register blocking is useful for explaining the transition from cache blocking to SIMD micro-kernels. Production code should usually select Backend::blas when a tuned BLAS is available.