|
numerics
|
Modern scientific computing demands computational power that single processors cannot deliver. Climate simulations, molecular dynamics, machine learning, and computational fluid dynamics all require billions of operations per second.
The end of frequency scaling: CPU clock speeds plateaued around 2005 at ~3-4 GHz due to power dissipation limits. The power consumed by a processor scales as:
\[P \propto C V^2 f\]
where \(C\) is capacitance, \(V\) is voltage, and \(f\) is frequency. Higher frequencies require higher voltages, causing power to scale roughly as \(f^3\).
The solution: parallelism. Instead of faster processors, we use more processors.
| Type | Description | Example |
|---|---|---|
| SISD | Single Instruction, Single Data | Classic uniprocessor |
| SIMD | Single Instruction, Multiple Data | GPU cores, vector units |
| MIMD | Multiple Instruction, Multiple Data | Multi-core CPUs, clusters |
Shared Memory: All processors access a common address space.
Distributed Memory: Each processor has private memory; data is exchanged via messages.
If a fraction \(s\) of a program is inherently serial, the maximum speedup with \(p\) processors is:
\[S(p) = \frac{1}{s + \frac{1-s}{p}}\]
As \(p \to \infty\):
\[S_{\max} = \frac{1}{s}\]
Example: If 5% of your code is serial ( \(s = 0.05\)), maximum speedup is \(1/0.05 = 20\times\), regardless of processor count.
Amdahl assumes fixed problem size. In practice, we scale problem size with processor count. If we keep total runtime constant:
\[S(p) = s + p(1-s) = p - s(p-1)\]
This is more optimistic: speedup scales linearly with \(p\) for fixed time.
Performance is bounded by either compute or memory bandwidth:
\[\text{Attainable FLOP/s} = \min\left( \text{Peak FLOP/s},\ \text{Bandwidth} \times \text{Arithmetic Intensity} \right)\]
Arithmetic Intensity \(= \frac{\text{FLOPs}}{\text{Bytes transferred}}\)
Operations with low arithmetic intensity are memory-bound; those with high intensity are compute-bound.
| Operation | Arithmetic Intensity | Bound |
|---|---|---|
| Vector addition | \(\frac{1}{12}\) FLOP/byte | Memory |
| Dot product | \(\frac{1}{8}\) FLOP/byte | Memory |
| Matrix-vector | \(\frac{2n}{8(n+1)} \approx \frac{1}{4}\) | Memory |
| Matrix-matrix | \(\frac{2n^3}{8 \cdot 3n^2} = \frac{n}{12}\) | Compute (for large \(n\)) |
A vector \(\mathbf{x} \in \mathbb{R}^n\) is an ordered collection of \(n\) real numbers:
\[\mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}\]
Vector Operations:
| Operation | Definition | FLOPs | Memory |
|---|---|---|---|
| Scale | \(\mathbf{y} = \alpha \mathbf{x}\) | \(n\) | \(2n\) |
| Addition | \(\mathbf{z} = \mathbf{x} + \mathbf{y}\) | \(n\) | \(3n\) |
| AXPY | \(\mathbf{y} = \alpha\mathbf{x} + \mathbf{y}\) | \(2n\) | \(3n\) |
| Dot product | \(\alpha = \mathbf{x}^T \mathbf{y} = \sum_{i=1}^n x_i y_i\) | \(2n\) | \(2n\) |
| Norm | \(\|\mathbf{x}\|_2 = \sqrt{\mathbf{x}^T \mathbf{x}}\) | \(2n\) | \(n\) |
A matrix \(A \in \mathbb{R}^{m \times n}\) is a 2D array with \(m\) rows and \(n\) columns:
\[A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}\]
Storage Layouts:
Row-major (C/C++): Elements in each row are contiguous.
Column-major (Fortran): Elements in each column are contiguous.
Given \(A \in \mathbb{R}^{m \times n}\) and \(\mathbf{x} \in \mathbb{R}^n\), compute \(\mathbf{y} = A\mathbf{x}\) where \(\mathbf{y} \in \mathbb{R}^m\):
\[y_i = \sum_{j=1}^{n} a_{ij} x_j \quad \text{for } i = 1, \ldots, m\]
Complexity: \(2mn\) FLOPs, \(O(mn)\) memory access.
Given \(A \in \mathbb{R}^{m \times k}\) and \(B \in \mathbb{R}^{k \times n}\), compute \(C = AB\) where \(C \in \mathbb{R}^{m \times n}\):
\[c_{ij} = \sum_{\ell=1}^{k} a_{i\ell} b_{\ell j}\]
Complexity: \(2mnk\) FLOPs.
For square matrices ( \(m = k = n\)), this is \(O(n^3)\) FLOPs with \(O(n^2)\) data—hence compute-bound for large \(n\).
1D Block Distribution: Partition vector/matrix rows among processors.
For vector \(\mathbf{x}\) of length \(n\) across \(p\) processors:
2D Block Distribution: Partition matrix into rectangular tiles.
For matrix \(A\) on a \(p_r \times p_c\) processor grid:
Each processor computes a local partial sum, then a global reduction combines results:
\[\mathbf{x}^T \mathbf{y} = \sum_{i=0}^{p-1} \left( \sum_{j \in \text{local}_i} x_j y_j \right)\]
For \(\mathbf{y} = A\mathbf{x}\) with row-distributed \(A\):
Communication cost: \(O(n)\) per processor for the gather.
For \(C = AB\) on a 2D processor grid, use algorithms like: