|
numerics 0.1.0
|
This page documents the cache-blocked dense matrix kernel used by Backend::blocked.
The dispatch path is:
The blocked multiplication computes
\[ C_{ij} \leftarrow C_{ij} + A_{ik}B_{kj} \]
over cache-sized panels:
The innermost loop walks contiguous entries of the row-major B and C tiles. This is the main difference from the naive i,j,k loop, where the access B(k,j) is strided in the innermost loop.
The implementation is intentionally in the sequential backend. BLAS dispatch is separate:
This keeps the raw/cache-blocked implementation available as a portable baseline even when BLAS is configured.
Compare:
Use Backend::blocked when BLAS is unavailable or when validating the custom dense kernel against the BLAS backend.