1. Why Parallel Computing?

Modern scientific computing demands computational power that single processors cannot deliver. Climate simulations, molecular dynamics, machine learning, and computational fluid dynamics all require billions of operations per second.

The end of frequency scaling: CPU clock speeds plateaued around 2005 at ~3-4 GHz due to power dissipation limits. The power consumed by a processor scales as:

\[P \propto C V^2 f\]

where \(C\) is capacitance, \(V\) is voltage, and \(f\) is frequency. Higher frequencies require higher voltages, causing power to scale roughly as \(f^3\).

The solution: parallelism. Instead of faster processors, we use more processors.

2. Taxonomy of Parallel Architectures

Flynn's Taxonomy

Type	Description	Example
SISD	Single Instruction, Single Data	Classic uniprocessor
SIMD	Single Instruction, Multiple Data	GPU cores, vector units
MIMD	Multiple Instruction, Multiple Data	Multi-core CPUs, clusters

Memory Models

Shared Memory: All processors access a common address space.

┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ P0  │ │ P1  │ │ P2  │ │ P3  │
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
   │       │       │       │
   └───────┴───┬───┴───────┘
               │
        ┌──────┴──────┐
        │   Memory    │
        └─────────────┘

Distributed Memory: Each processor has private memory; data is exchanged via messages.

┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
│ P0 │ M0 │   │ P1 │ M1 │   │ P2 │ M2 │   │ P3 │ M3 │
└────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘
     │             │             │             │
     └─────────────┴──────┬──────┴─────────────┘
                          │
                    Network/Interconnect

3. Performance Limits

Amdahl's Law

If a fraction \(s\) of a program is inherently serial, the maximum speedup with \(p\) processors is:

\[S(p) = \frac{1}{s + \frac{1-s}{p}}\]

As \(p \to \infty\):

\[S_{\max} = \frac{1}{s}\]

Example: If 5% of your code is serial ( \(s = 0.05\)), maximum speedup is \(1/0.05 = 20\times\), regardless of processor count.

Gustafson's Law

Amdahl assumes fixed problem size. In practice, we scale problem size with processor count. If we keep total runtime constant:

\[S(p) = s + p(1-s) = p - s(p-1)\]

This is more optimistic: speedup scales linearly with \(p\) for fixed time.

Roofline Model

Performance is bounded by either compute or memory bandwidth:

\[\text{Attainable FLOP/s} = \min\left( \text{Peak FLOP/s},\ \text{Bandwidth} \times \text{Arithmetic Intensity} \right)\]

Arithmetic Intensity \(= \frac{\text{FLOPs}}{\text{Bytes transferred}}\)

Operations with low arithmetic intensity are memory-bound; those with high intensity are compute-bound.

Operation	Arithmetic Intensity	Bound
Vector addition	\(\frac{1}{12}\) FLOP/byte	Memory
Dot product	\(\frac{1}{8}\) FLOP/byte	Memory
Matrix-vector	\(\frac{2n}{8(n+1)} \approx \frac{1}{4}\)	Memory
Matrix-matrix	\(\frac{2n^3}{8 \cdot 3n^2} = \frac{n}{12}\)	Compute (for large \(n\))

4. Linear Algebra Fundamentals

Vectors

A vector \(\mathbf{x} \in \mathbb{R}^n\) is an ordered collection of \(n\) real numbers:

\[\mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}\]

Vector Operations:

Operation	Definition	FLOPs	Memory
Scale	\(\mathbf{y} = \alpha \mathbf{x}\)	\(n\)	\(2n\)
Addition	\(\mathbf{z} = \mathbf{x} + \mathbf{y}\)	\(n\)	\(3n\)
AXPY	\(\mathbf{y} = \alpha\mathbf{x} + \mathbf{y}\)	\(2n\)	\(3n\)
Dot product	\(\alpha = \mathbf{x}^T \mathbf{y} = \sum_{i=1}^n x_i y_i\)	\(2n\)	\(2n\)
Norm	\(\\|\mathbf{x}\\|_2 = \sqrt{\mathbf{x}^T \mathbf{x}}\)	\(2n\)	\(n\)

Matrices

A matrix \(A \in \mathbb{R}^{m \times n}\) is a 2D array with \(m\) rows and \(n\) columns:

\[A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{pmatrix}\]

Storage Layouts:

Row-major (C/C++): Elements in each row are contiguous.

A[i][j] → memory[i * n + j]

Column-major (Fortran): Elements in each column are contiguous.

A[i][j] → memory[j * m + i]

Matrix-Vector Multiplication

Given \(A \in \mathbb{R}^{m \times n}\) and \(\mathbf{x} \in \mathbb{R}^n\), compute \(\mathbf{y} = A\mathbf{x}\) where \(\mathbf{y} \in \mathbb{R}^m\):

\[y_i = \sum_{j=1}^{n} a_{ij} x_j \quad \text{for } i = 1, \ldots, m\]

Complexity: \(2mn\) FLOPs, \(O(mn)\) memory access.

for (int i = 0; i < m; ++i) {
    y[i] = 0;
    for (int j = 0; j < n; ++j) {
        y[i] += A[i * n + j] * x[j];
    }
}

Matrix-Matrix Multiplication

Given \(A \in \mathbb{R}^{m \times k}\) and \(B \in \mathbb{R}^{k \times n}\), compute \(C = AB\) where \(C \in \mathbb{R}^{m \times n}\):

\[c_{ij} = \sum_{\ell=1}^{k} a_{i\ell} b_{\ell j}\]

Complexity: \(2mnk\) FLOPs.

For square matrices ( \(m = k = n\)), this is \(O(n^3)\) FLOPs with \(O(n^2)\) data—hence compute-bound for large \(n\).

for (int i = 0; i < m; ++i) {
    for (int j = 0; j < n; ++j) {
        C[i * n + j] = 0;
        for (int l = 0; l < k; ++l) {
            C[i * n + j] += A[i * k + l] * B[l * n + j];
        }
    }
}

5. Parallelizing Linear Algebra

Data Decomposition

1D Block Distribution: Partition vector/matrix rows among processors.

For vector \(\mathbf{x}\) of length \(n\) across \(p\) processors:

Processor \(i\) owns elements \([i \cdot n/p,\ (i+1) \cdot n/p)\)

2D Block Distribution: Partition matrix into rectangular tiles.

For matrix \(A\) on a \(p_r \times p_c\) processor grid:

Processor \((i, j)\) owns a block of size \((m/p_r) \times (n/p_c)\)

Parallel Dot Product

Each processor computes a local partial sum, then a global reduction combines results:

\[\mathbf{x}^T \mathbf{y} = \sum_{i=0}^{p-1} \left( \sum_{j \in \text{local}_i} x_j y_j \right)\]

Processor 0:  x[0:249]·y[0:249]   = 125.0  ─┐
Processor 1:  x[250:499]·y[250:499] = 130.0  ├─→ Reduce(+) → 510.0
Processor 2:  x[500:749]·y[500:749] = 128.0  │
Processor 3:  x[750:999]·y[750:999] = 127.0  ─┘

Parallel Matrix-Vector Product

For \(\mathbf{y} = A\mathbf{x}\) with row-distributed \(A\):

Each processor needs the entire vector \(\mathbf{x}\)
Broadcast or all-gather \(\mathbf{x}\)
Each processor computes its local rows of \(\mathbf{y}\)

Communication cost: \(O(n)\) per processor for the gather.

Parallel Matrix-Matrix Product

For \(C = AB\) on a 2D processor grid, use algorithms like:

Cannon's Algorithm: Shift-based, memory-efficient
SUMMA: Broadcast-based, simpler to implement
2.5D Algorithms: Trade memory for reduced communication

6. Key Takeaways

Parallelism is essential: Clock speeds won't save us; we must use multiple processors.
Know your bounds: Use Amdahl/Gustafson to estimate potential speedup; use Roofline to identify bottlenecks.
Arithmetic intensity matters: Matrix-matrix multiplication parallelizes well; dot products are memory-bound.
Communication is expensive: Minimizing data movement is often more important than minimizing computation.
Data layout affects performance: Row-major vs. column-major, cache utilization, and memory access patterns are critical.

Exercises

A program spends 10% of its time in serial code. What is the maximum speedup with (a) 10 processors? (b) 100 processors? (c) infinite processors?
Compute the arithmetic intensity of the AXPY operation \(\mathbf{y} = \alpha \mathbf{x} + \mathbf{y}\).
For a 1000×1000 matrix-matrix multiplication, how many FLOPs are performed? If your machine achieves 100 GFLOP/s, what is the minimum execution time?
You have a vector of length \(10^6\) distributed across 4 processors. Describe the communication pattern needed to compute the Euclidean norm.

References

Hennessy & Patterson, Computer Architecture: A Quantitative Approach
Golub & Van Loan, Matrix Computations
Williams, Waterman & Patterson, "Roofline: An Insightful Visual Performance Model"