numerics 0.1.0
Loading...
Searching...
No Matches
Cache Blocking Implementation Note

This page documents the cache-blocked dense matrix kernel used by Backend::blocked.

Public Entry Point

num::Matrix C(A.rows(), B.cols(), 0.0);
void matmul(const Matrix &A, const Matrix &B, Matrix &C, Backend b=default_backend)
C = A * B.
Definition matrix.cpp:20

The dispatch path is:

num::matmul
-> num::backends::seq::matmul_blocked

Kernel Shape

The blocked multiplication computes

\[ C_{ij} \leftarrow C_{ij} + A_{ik}B_{kj} \]

over cache-sized panels:

for (idx ii = 0; ii < M; ii += block_size)
for (idx jj = 0; jj < N; jj += block_size)
for (idx kk = 0; kk < K; kk += block_size)
for (idx i = ii; i < i_end; ++i)
for (idx k = kk; k < k_end; ++k)
for (idx j = jj; j < j_end; ++j)
C(i, j) += A(i, k) * B(k, j);

The innermost loop walks contiguous entries of the row-major B and C tiles. This is the main difference from the naive i,j,k loop, where the access B(k,j) is strided in the innermost loop.

Implementation Location

src/core/backends/seq/matrix.cpp

The implementation is intentionally in the sequential backend. BLAS dispatch is separate:

src/core/backends/blas/matrix.cpp

This keeps the raw/cache-blocked implementation available as a portable baseline even when BLAS is configured.

Benchmark

./build/benchmarks/numerics_bench --benchmark_filter=BM_Matmul

Compare:

When To Use It

Use Backend::blocked when BLAS is unavailable or when validating the custom dense kernel against the BLAS backend.