|
numerics 0.1.0
|
The 2D Navier-Stokes stress test (ns_demo) solves the incompressible Navier-Stokes equations on a periodic MAC grid using Chorin's projection method, visualised in real-time with raylib. The initial implementation ran too slowly to be interactive on a 256x256 grid. Three focused fixes – each traceable to a single hardware-software mismatch – recovered real-time performance without changing the algorithm.
Each timestep executes three phases:
The Poisson solve is the inner loop: it calls num::cg_matfree with a matrix-free stencil operator as the matvec, issuing hundreds of dot products and AXPY operations per frame. Everything else is \(O(N^2)\) per timestep; the pressure solve is where time is spent.
The matvec for the matrix-free CG applies the negative discrete Laplacian on a periodic \(N \times N\) grid:
\[ (-\Delta p)_{i,j} = \frac{1}{h^2}\bigl( 4\,p_{i,j} - p_{i+1,j} - p_{i-1,j} - p_{i,j+1} - p_{i,j-1} \bigr) \]
All indices are taken modulo \(N\) (periodic boundary conditions). This operator is positive semi-definite on the zero-mean subspace, which is exactly what CG requires.
The naive inner loop inside the stencil operator was:
wp1(j) and wm1(j) call (j+1) % N and (j+N-1) % N respectively. Even though the modulo only matters at j = 0 and j = N-1, the compiler cannot prove that at the vectorisation stage. On Apple Silicon the inner loop must lower to ARM NEON vfmaq_f64 instructions that operate on two doubles simultaneously – but that requires the gather indices wp1(j) and wm1(j) to be predictable, unit-stride offsets. Modulo calls break that guarantee.
Result: the compiler falls back to scalar code. The stencil runs at 1/2 the arithmetic throughput available to the hardware.
Peel the two boundary iterations out of the inner loop, leaving j = 1..N-2 with no modulo calls:
The inner loop now accesses five contiguous arrays with fixed offsets \(-1, 0,
+1\) – a classic 1-D stencil pattern. Compiled with -O3 -march=native, clang emits ld1 loads and fmla (fused multiply-accumulate) instructions over pairs of doubles.
The five-point stencil reads five doubles per output element and writes one:
\[ I = \frac{6 \text{ FLOPs}}{6 \times 8 \text{ bytes}} = \frac{1}{8} \text{ FLOP/byte} \]
This is solidly memory-bandwidth bound – the same category as dot product. Vectorisation does not change the FLOP count; it cuts the instruction count in half (two doubles per NEON lane), which reduces loop overhead and frees execution units. The practical gain is approximately 1.5-2x on Apple M-series because the M-core has a generous load bandwidth that keeps NEON fed.
The outer i loop is parallelised with #pragma omp parallel for schedule(static), so all \(N\) rows run concurrently across performance cores.
The CG inner loop consists of six vector operations per iteration – two dot products, three AXPY, one norm – all over vectors of length \(N^2\). The original code used num::omp (Backend::omp):
For \(N = 256\) the vectors are \(256^2 = 65\,536\) doubles, occupying 512 KB. The M-series L2 cache is 12 MB per cluster; all three active CG vectors fit comfortably.
The problem is thread fork/join overhead. num::omp spawns a thread team (or wakes a parked team) for each operation. OpenMP barrier synchronisation on macOS/Apple Silicon costs roughly 5-15 mus per fork – the same order of magnitude as the vector operation itself at this size.
Switch the CG kernel to num::blas (Backend::blas, Accelerate framework):
Apple's Accelerate cblas_ddot and cblas_daxpy are single-call, internally-threaded routines that avoid user-space fork overhead. For cache-resident vectors Accelerate typically uses NEON with software pipelining optimised for the specific Apple Silicon memory subsystem.
| Vector size | Working set | Recommended backend |
|---|---|---|
| \(n \leq 2^{16}\) | <= 512 KB – fits in L2 | Backend::blas (Accelerate/MKL) |
| \(2^{16} < n \leq 2^{20}\) | 512 KB - 8 MB | either; benchmark on target |
| \(n > 2^{20}\) | > 8 MB – spills to L3/DRAM | Backend::omp amortises fork |
The crossover is hardware-dependent. On Linux with glibc pthreads and a small thread pool, num::omp fork overhead is typically lower and the crossover shifts toward smaller vectors.
The discrete negative Laplacian on an \(N \times N\) periodic grid has eigenvalues
\[ \lambda_{k,l} = \frac{4}{h^2}\Bigl( \sin^2\!\frac{\pi k}{N} + \sin^2\!\frac{\pi l}{N} \Bigr), \quad k,l = 0,\ldots,N-1 \]
restricted to the zero-mean subspace (the \(k = l = 0\) DC mode is removed). The minimum non-zero eigenvalue is
\[ \lambda_{\min} = \frac{4}{h^2}\sin^2\!\frac{\pi}{N} \approx \frac{4\pi^2}{N^2 h^2} = 4\pi^2 \quad (h = 1/N) \]
and the maximum is \(\lambda_{\max} = 8/h^2 = 8N^2\), giving spectral condition number
\[ \kappa = \frac{\lambda_{\max}}{\lambda_{\min}} \approx \frac{8N^2}{4\pi^2} = \frac{2N^2}{\pi^2} \]
For \(N = 256\): \(\kappa \approx 13\,300\).
For an SPD system with condition number \(\kappa\), CG satisfies
\[ \frac{\|\mathbf{e}_k\|_A}{\|\mathbf{e}_0\|_A} \leq 2\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^k \]
To reduce the error by a factor of \(10^{-4}\) from the initial residual, the number of iterations required is approximately
\[ k \geq \frac{\ln(2 \times 10^4)}{\ln\!\left(\dfrac{\sqrt{\kappa}+1}{\sqrt{\kappa}-1}\right)} \approx \frac{\sqrt{\kappa}}{2} \ln(2 \times 10^4) \approx \frac{115}{2} \times 9.9 \approx 570 \]
With tolerance 1e-3 the target drops to \(10^{-3}\), requiring roughly \(\sqrt{\kappa}/2 \times \ln(2000) \approx 430\) iterations – still large.
The pressure field changes very little between consecutive timesteps. Initialising each CG solve with \(p^{n-1}\) rather than zero places the initial residual already close to the solution. The practical initial relative residual is typically below \(10^{-2}\) after the first few timesteps, cutting the number of iterations needed to below \(\sim 100\).
The solver caps iterations at max_iter = 100. For a well-started warm solve at \(N = 256\), convergence to tol = 1e-3 typically requires 60-90 iterations.
The constant function is in the null space of the Laplacian, so the right-hand side must be zero-mean for a solution to exist. Before the solve, the mean of the RHS is removed:
\[ b \leftarrow b - \bar{b}, \quad \bar{b} = \frac{1}{N^2}\sum_{i,j} b_{i,j} \]
After convergence the mean of the solution is similarly removed. This keeps the pressure in the correct (zero-mean) invariant subspace and prevents CG from chasing the null-space component, which would stall convergence.
The three fixes act on independent bottlenecks:
| Fix | Root cause | Mechanism | Observed gain |
|---|---|---|---|
| Boundary peeling | Modulo in inner loop blocks NEON | Enables auto-vectorisation | ~1.8x stencil throughput |
Backend::blas | OMP fork overhead >> vector op time | Eliminates thread sync | ~2x per CG vector op |
| Warm-start + loose tolerance | CG initialised far from solution | Reduces iteration count | ~3-5x fewer CG iters/frame |
No single fix dominates; the product of all three converts a slideshow into a simulation that runs comfortably above 30 FPS at \(N = 256\) on Apple Silicon, with the pressure solve consuming 5-15 ms per frame instead of 150+ ms.
The lesson is a recurring one in high-performance computing: before adding algorithmic complexity, confirm that the baseline is not simply leaving hardware throughput on the floor through avoidable penalties – modulo calls in hot loops, mismatched parallelism granularity, and unnecessary solver restarts are all examples of exactly that.
Controls – SPACE pause/resume, R reset to shear layer IC, +/- adjust substeps per frame, [/] scale vorticity colormap, ESC quit.
The HUD reports CG iteration count, final residual, and per-phase wall time each frame so the solver behaviour can be observed in real time.