1. GPU Architecture
GPUs are massively parallel processors designed for throughput, not latency. While a CPU has a few powerful cores optimized for sequential execution, a GPU has thousands of simpler cores optimized for parallel execution.
CPU vs. GPU Philosophy
| Aspect | CPU | GPU |
| Cores | 4-64 complex cores | 1000s of simple cores |
| Clock speed | 3-5 GHz | 1-2 GHz |
| Cache | Large (MB) | Small (KB per SM) |
| Control | Sophisticated branch prediction | Simple in-order |
| Strength | Latency | Throughput |
CPU: Few powerful cores GPU: Many simple cores
┌─────────────────────┐ ┌─────────────────────────────────┐
│ ┌─────┐ ┌─────┐ │ │ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ │
│ │Core │ │Core │ │ │ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ │
│ │ 0 │ │ 1 │ │ │ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ │
│ └─────┘ └─────┘ │ │ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ │
│ ┌─────┐ ┌─────┐ │ │ SM 0 SM 1 SM 2 ... │
│ │Core │ │Core │ │ └─────────────────────────────────┘
│ │ 2 │ │ 3 │ │
│ └─────┘ └─────┘ │
└─────────────────────┘
NVIDIA GPU Hierarchy
Streaming Multiprocessors (SMs): Independent processing units, each containing:
- CUDA cores (32-128 per SM)
- Shared memory / L1 cache
- Registers
- Warp schedulers
Warps: Groups of 32 threads that execute in lockstep (SIMT: Single Instruction, Multiple Threads).
2. CUDA Programming Model
Kernels and Threads
A kernel is a function that runs on the GPU. When launched, it executes across many parallel threads.
__global__ void add(float* a, float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
int threads = 256;
int blocks = (n + threads - 1) / threads;
add<<<blocks, threads>>>(d_a, d_b, d_c, n);
cudaDeviceSynchronize();
}
Thread Hierarchy
Threads are organized in a two-level hierarchy:
Grid (all threads for one kernel launch)
├── Block 0
│ ├── Thread 0
│ ├── Thread 1
│ └── ...
├── Block 1
│ ├── Thread 0
│ ├── Thread 1
│ └── ...
└── ...
Built-in variables:
| Variable | Meaning |
threadIdx.x | Thread index within block (0 to blockDim.x-1) |
blockIdx.x | Block index within grid |
blockDim.x | Threads per block |
gridDim.x | Blocks in grid |
Global thread ID: i = blockIdx.x * blockDim.x + threadIdx.x
Launch Configuration
kernel<<<numBlocks, threadsPerBlock>>>(args...);
Rules of thumb:
threadsPerBlock: Multiple of 32 (warp size), typically 128-512
numBlocks: Enough to cover your data: (n + threadsPerBlock - 1) / threadsPerBlock
For 2D problems:
dim3 block(16, 16);
dim3 grid((n+15)/16, (m+15)/16);
kernel<<<grid, block>>>(...);
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
3. Memory Hierarchy
GPU memory is hierarchical; choosing the right memory type is critical for performance.
┌─────────────────────────────────────────────────────────┐
│ Global Memory │
│ (Large, slow: ~500 GB/s) │
└─────────────────────────────────────────────────────────┘
↑ ↑
│ │
┌───────┴───────┐ ┌─────────┴─────────┐
│ SM 0 │ │ SM 1 │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Shared │ │ │ │ Shared │ │
│ │ Memory │ │ (~100 TB/s) │ │ Memory │ │
│ │ (48 KB) │ │ │ │ (48 KB) │ │
│ └───────────┘ │ │ └───────────┘ │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Registers │ │ (fastest) │ │ Registers │ │
│ └───────────┘ │ │ └───────────┘ │
└───────────────┘ └───────────────────┘
Memory Types
| Memory | Scope | Speed | Size | Lifetime |
| Registers | Thread | Fastest | ~255 per thread | Thread |
| Shared | Block | ~100 TB/s | 48-164 KB per SM | Block |
| Global | Grid | ~500-900 GB/s | GBs | Application |
| Constant | Grid | Cached | 64 KB | Application |
Global Memory
Allocated on host, accessible by all threads:
float* d_data;
cudaMalloc(&d_data, n * sizeof(float));
cudaMemcpy(d_data, h_data, n * sizeof(float),
cudaMemcpyHostToDevice);
cudaMemcpy(h_data, d_data, n * sizeof(float),
cudaMemcpyDeviceToHost);
cudaFree(d_data);
Shared Memory
Fast, on-chip memory shared by threads in a block:
__global__ void reduce(float* input, float* output, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + tid;
sdata[tid] = (i < n) ? input[i] : 0;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) output[blockIdx.x] = sdata[0];
}
Key points:
- Declare with
__shared__
- Use
__syncthreads() to synchronize threads within a block
- Bank conflicts can reduce performance (32 banks, consecutive addresses)
4. Common Patterns
Map (Element-wise Operations)
Each thread processes one element:
__global__ void scale(float* v, int n, float alpha) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
v[i] *= alpha;
}
}
Reduction
Combine all elements (sum, max, etc.):
__global__ void dot_product(const float* x, const float* y, float* result, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + tid;
sdata[tid] = (i < n) ? x[i] * y[i] : 0;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid == 0) atomicAdd(result, sdata[0]);
}
Note: atomicAdd for doubles requires compute capability 6.0+ or a CAS-based implementation:
__device__ double atomicAddDouble(double* addr, double val) {
unsigned long long* addr_ull = (unsigned long long*)addr;
unsigned long long old = *addr_ull, assumed;
do {
assumed = old;
old = atomicCAS(addr_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
Stencil (Neighbors)
Each element depends on neighbors (e.g., finite differences):
__global__ void stencil_1d(float* in, float* out, int n) {
__shared__ float s[BLOCK + 2];
int gid = blockIdx.x * blockDim.x + threadIdx.x;
int lid = threadIdx.x + 1;
s[lid] = (gid < n) ? in[gid] : 0;
if (threadIdx.x == 0 && gid > 0)
s[0] = in[gid - 1];
if (threadIdx.x == blockDim.x - 1 && gid < n - 1)
s[lid + 1] = in[gid + 1];
__syncthreads();
if (gid > 0 && gid < n - 1) {
out[gid] = 0.25f * (s[lid-1] + 2*s[lid] + s[lid+1]);
}
}
5. Matrix Operations on GPU
Matrix-Vector Multiplication
Each thread computes one element of the output:
__global__ void matvec(const double* A, const double* x, double* y,
int rows, int cols) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < rows) {
double sum = 0;
for (int j = 0; j < cols; ++j) {
sum += A[i * cols + j] * x[j];
}
y[i] = sum;
}
}
Matrix-Matrix Multiplication
Naive version (each thread computes one element of C):
__global__ void matmul_naive(const double* A, const double* B, double* C,
int M, int K, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
double sum = 0;
for (int k = 0; k < K; ++k) {
sum += A[row * K + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
Tiled version using shared memory (much faster):
#define TILE 16
__global__ void matmul_tiled(const double* A, const double* B, double* C,
int M, int K, int N) {
__shared__ double As[TILE][TILE];
__shared__ double Bs[TILE][TILE];
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
double sum = 0;
for (int t = 0; t < (K + TILE - 1) / TILE; ++t) {
int a_col = t * TILE + threadIdx.x;
int b_row = t * TILE + threadIdx.y;
As[threadIdx.y][threadIdx.x] = (row < M && a_col < K) ?
A[row * K + a_col] : 0;
Bs[threadIdx.y][threadIdx.x] = (b_row < K && col < N) ?
B[b_row * N + col] : 0;
__syncthreads();
for (int k = 0; k < TILE; ++k) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
if (row < M && col < N) {
C[row * N + col] = sum;
}
}
Why tiling helps:
- Without tiling: Each element of A and B is read \(N\) times from global memory
- With tiling: Each element is read once per tile into shared memory, then reused
6. Performance Optimization
Coalesced Memory Access
Threads in a warp should access consecutive memory addresses:
int i = blockIdx.x * blockDim.x + threadIdx.x;
float val = data[i];
float val = data[i * stride];
Occupancy
Occupancy = active warps / maximum warps per SM
Higher occupancy helps hide memory latency. Limited by:
- Registers per thread
- Shared memory per block
- Threads per block
Use cudaOccupancyMaxPotentialBlockSize to find optimal configuration.
Avoiding Warp Divergence
Threads in a warp execute the same instruction. If threads take different branches, both paths execute serially:
if (threadIdx.x % 2 == 0) {
} else {
}
if (threadIdx.x < 16) {
} else {
}
Use the Profiler
NVIDIA tools help identify bottlenecks:
nvprof / nsys — Timeline profiler
ncu (Nsight Compute) — Kernel analysis
cuda-memcheck — Memory error detection
nvprof ./my_program
nsys profile ./my_program
ncu --set full ./my_program
7. Our Library's CUDA Interface
The num::cuda namespace provides low-level operations:
}
void to_device(real *dst, const real *src, idx n)
Copy host to device.
void scale(real *v, idx n, real alpha)
v = alpha * v
void matmul(const real *A, const real *B, real *C, idx m, idx k, idx n)
C = A * B.
void free(real *ptr)
Free device memory.
real * alloc(idx n)
Allocate device memory.
void to_host(real *dst, const real *src, idx n)
Copy device to host.
void add(const real *x, const real *y, real *z, idx n)
z = x + y
void axpy(real alpha, const real *x, real *y, idx n)
y = alpha*x + y
real dot(const real *x, const real *y, idx n)
dot product
void matvec(const real *A, const real *x, real *y, idx rows, idx cols)
y = A * x (row-major A)
constexpr T ipow(T x) noexcept
Compute x^N at compile time via repeated squaring.
High-level interface with automatic memory management:
x.to_gpu();
y.to_gpu();
y.to_cpu();
void axpy(real alpha, const Vector &x, Vector &y, Backend b=default_backend)
y += alpha * x
8. CPU vs GPU: When to Use What
| Use CPU | Use GPU |
| Small data (< 10K elements) | Large data (> 100K elements) |
| Complex branching logic | Regular, data-parallel ops |
| Sequential algorithms | Embarrassingly parallel |
| Quick prototyping | Production performance |
Kernel launch overhead: ~5-20 μs. For small operations, this dominates.
From our benchmarks:
| Operation | Crossover Point |
| Dot product | GPU never wins (memory-bound, launch overhead dominates) |
| AXPY | ~64K elements |
| Matvec | ~256×256 |
| Matmul | GPU always wins (compute-bound) |
Exercises
- Write a CUDA kernel that computes \(\mathbf{z} = \mathbf{x} \cdot \mathbf{y}\) (element-wise product).
- Explain why the tiled matrix multiplication is faster. Calculate the reduction in global memory accesses for a 1024×1024 multiplication with TILE=32.
- A kernel uses 64 registers per thread and 48 KB shared memory. If the GPU has 65536 registers and 96 KB shared memory per SM, what is the maximum occupancy with 256 threads per block?
- Identify the warp divergence in this code and fix it:
if (threadIdx.x > data[threadIdx.x]) {
output[threadIdx.x] = 1;
} else {
output[threadIdx.x] = 0;
}
- Profile the library's
matmul on GPU using nvprof. What is the achieved memory bandwidth? How does it compare to the theoretical peak?
References
- Kirk & Hwu, Programming Massively Parallel Processors
- NVIDIA, CUDA C++ Programming Guide
- NVIDIA, CUDA C++ Best Practices Guide
- Harris, "Optimizing Parallel Reduction in CUDA"