1. Introduction
Banded matrices arise naturally in the discretization of differential equations, particularly in:
- Radiative transfer: Two-stream and multi-stream approximations (TUVX photolysis)
- Finite differences: Higher-order stencils produce wider bands
- Finite elements: Local basis functions create banded mass/stiffness matrices
- Spline interpolation: B-spline collocation matrices
A banded matrix \(A \in \mathbb{R}^{n \times n}\) has bandwidth parameters \((k_l, k_u)\) where:
- \(k_l\) = number of lower (sub-) diagonals
- \(k_u\) = number of upper (super-) diagonals
The element \(a_{ij} = 0\) whenever \(j < i - k_l\) or \(j > i + k_u\).
Structure visualization (pentadiagonal, \(k_l = k_u = 2\)):
\[A = \begin{pmatrix}
\times & \times & \times & 0 & 0 & \cdots \\
\times & \times & \times & \times & 0 & \cdots \\
\times & \times & \times & \times & \times & \cdots \\
0 & \times & \times & \times & \times & \cdots \\
\vdots & & \ddots & \ddots & \ddots & \ddots
\end{pmatrix}\]
Key advantage: Exploiting band structure reduces:
- Storage: \(O(n \cdot (k_l + k_u + 1))\) instead of \(O(n^2)\)
- Solve complexity: \(O(n \cdot k_l \cdot (k_l + k_u))\) instead of \(O(n^3)\)
2. Band Storage Formats
2.1 LAPACK Band Storage (Column-Major)
The standard format for band storage, used in LAPACK's DGBTRF/DGBTRS, stores diagonals in rows of a 2D array with leading dimension \(\text{ldab} = 2k_l + k_u + 1\).
Layout: For matrix element \(a_{ij}\) within the band:
\[\text{AB}(k_l + k_u + i - j, \, j) = a_{ij}\]
where AB is the band storage array with 0-based indexing.
Visual representation (for \(n=5\), \(k_l=2\), \(k_u=1\)):
Original matrix A: Band storage AB (ldab=6, n=5):
| a00 a01 0 0 0 | row 0: [ * * * * * ] (extra for fill-in)
| a10 a11 a12 0 0 | row 1: [ * * * * * ] (extra for fill-in)
| a20 a21 a22 a23 0 | row 2: [a01 a12 a23 a34 * ] (upper diagonal)
| 0 a31 a32 a33 a34 | row 3: [a00 a11 a22 a33 a44] (main diagonal)
| 0 0 a42 a43 a44 | row 4: [a10 a21 a32 a43 * ] (lower diagonal 1)
row 5: [a20 a31 a42 * * ] (lower diagonal 2)
**Why extra \(k_l\) rows?** During LU factorization with row pivoting, fill-in can occur in the upper triangle up to \(k_l\) positions above the original upper bandwidth.
2.2 Memory Layout Benefits
Column-major storage (as in the layout above) ensures that:
- LU factorization accesses memory sequentially when eliminating within a column
- Forward/back substitution has good cache behavior
- Direct LAPACK compatibility for fallback to optimized libraries
**Memory access pattern during elimination of column \(j\)**:
for (idx i = 1; i <= min(kl, n-j-1); ++i) {
AB[kv + i + j*ldab] *= inv_pivot;
}
3. LU Factorization for Banded Matrices
3.1 Mathematical Foundation
For a banded matrix \(A\) with bandwidths \((k_l, k_u)\), we compute \(PA = LU\) where:
- \(P\) is a permutation matrix (from partial pivoting)
- \(L\) is unit lower triangular with at most \(k_l\) sub-diagonals
- \(U\) is upper triangular with at most \(k_l + k_u\) super-diagonals (due to fill-in)
Theorem (Band Preservation): If \(A\) has lower bandwidth \(k_l\) and upper bandwidth \(k_u\), then:
- \(L\) has lower bandwidth \(k_l\)
- \(U\) has upper bandwidth \(k_l + k_u\) (worst case with pivoting)
Proof: During Gaussian elimination of column \(j\):
- Without pivoting: row \(i > j\) is modified by subtracting a multiple of row \(j\)
- The rightmost nonzero in row \(j\) is at column \(j + k_u\)
- The rightmost nonzero in row \(i\) (for \(i \leq j + k_l\)) is at column \(i + k_u\)
- After elimination, row \(i\) can have nonzeros up to column \(\max(j + k_u, i + k_u) = j + k_u\) (since \(i > j\))
With pivoting, row \(j\) may come from row \(j + k_l\) at worst, creating fill-in up to column \((j + k_l) + k_u\) for original row \(j\). But this is stored in the extra \(k_l\) rows we allocated. \(\square\)
3.2 Algorithm Derivation
Standard Gaussian elimination adapted for band structure:
For column \(j = 0, 1, \ldots, n-1\):
- Pivot selection (partial pivoting): Search rows \(j\) to \(\min(j + k_l, n-1)\) for largest \(|a_{ij}|\)
- Row interchange: If pivot row \(p \neq j\), swap rows \(p\) and \(j\) (Only columns \(\max(0, j-k_u)\) to \(\min(j+k_l, n-1)\) need swapping)
- Elimination: For rows \(i = j+1\) to \(\min(j + k_l, n-1)\):
- Compute multiplier: \(l_{ij} = a_{ij} / a_{jj}\)
- Update row \(i\): \(a_{ik} \leftarrow a_{ik} - l_{ij} \cdot a_{jk}\) for \(k = j+1\) to \(\min(j + k_u, n-1)\)
3.3 The Algorithm
Input: A in band storage (ldab × n), pivot array ipiv[n]
Output: LU factors in A, permutation in ipiv
kv = kl + ku // Offset to main diagonal in band storage
for j = 0 to n-1:
// 1. Find pivot in column j
pivot_row = j
max_val = |AB[kv, j]|
for i = j+1 to min(j+kl, n-1):
if |AB[kv + i - j, j]| > max_val:
max_val = |AB[kv + i - j, j]|
pivot_row = i
ipiv[j] = pivot_row
// 2. Check for singularity
if AB[kv, j] == 0:
return SINGULAR at row j
// 3. Swap rows if needed
if pivot_row ≠ j:
for col = max(0, j-ku) to min(j+kl, n-1):
swap AB[kv + j - col, col] and AB[kv + pivot_row - col, col]
// 4. Compute multipliers and eliminate
pivot_val = AB[kv, j]
num_elim = min(kl, n - j - 1)
// Scale column (compute L multipliers)
for i = 1 to num_elim:
AB[kv + i, j] /= pivot_val
// Update trailing submatrix
for k = j+1 to min(j + ku, n-1):
a_jk = AB[kv + j - k, k]
if a_jk ≠ 0:
for i = 1 to num_elim:
AB[kv + i + j - k, k] -= AB[kv + i, j] * a_jk
3.4 Solving After Factorization
Given \(PA = LU\), solve \(A\mathbf{x} = \mathbf{b}\) in three steps:
Step 1: Apply permutation ( \(\mathbf{b} \leftarrow P\mathbf{b}\))
for i = 0 to n-1:
if ipiv[i] ≠ i:
swap b[i] and b[ipiv[i]]
Step 2: Forward substitution (solve \(L\mathbf{y} = P\mathbf{b}\))
for j = 0 to n-1:
if b[j] ≠ 0:
for i = j+1 to min(j + kl, n-1):
b[i] -= AB[kv + i - j, j] * b[j]
Step 3: Back substitution (solve \(U\mathbf{x} = \mathbf{y}\))
for j = n-1 down to 0:
b[j] /= AB[kv, j]
if b[j] ≠ 0:
for i = max(0, j - ku) to j-1:
b[i] -= AB[kv + i - j, j] * b[j]
4. Complexity Analysis
4.1 Operation Counts
LU Factorization:
For column \(j\), the number of operations is:
- Pivot search: \(O(k_l)\) comparisons
- Row swap: \(O(k_l + k_u)\) swaps (at most)
- Multiplier computation: \(k_l\) divisions (for rows \(j+1\) to \(j+k_l\))
- Submatrix update: \(k_l \times k_u\) multiply-adds
Summing over all columns:
\[\text{FLOPs} = \sum_{j=0}^{n-1} O(k_l \cdot k_u) = O(n \cdot k_l \cdot k_u)\]
More precisely, accounting for boundary effects:
\[\text{FLOPs} \approx n \cdot k_l \cdot (k_l + k_u) + O(k_l^2 \cdot k_u)\]
Forward/Back Substitution:
- Forward: \(\sum_{j=0}^{n-1} O(k_l) = O(n \cdot k_l)\)
- Back: \(\sum_{j=0}^{n-1} O(k_u) = O(n \cdot k_u)\)
- Total solve: \(O(n \cdot (k_l + k_u))\)
4.2 Comparison with Dense and Tridiagonal
| Method | Factor | Solve | Storage |
| Dense LU | \(\frac{2}{3}n^3\) | \(2n^2\) | \(n^2\) |
| Banded LU | \(n \cdot k_l \cdot (k_l + k_u)\) | \(2n(k_l + k_u)\) | \((2k_l + k_u + 1) \cdot n\) |
| Tridiagonal (Thomas) | \(8n\) | \(5n\) | \(4n\) |
Example: \(n = 10000\), pentadiagonal ( \(k_l = k_u = 2\))
- Dense: \(6.7 \times 10^{11}\) FLOPs for factor
- Banded: \(10000 \times 2 \times 4 = 80000\) FLOPs for factor
- Speedup: \(8.3 \times 10^6\) times faster!
4.3 Memory Bandwidth Analysis
For large \(n\), the algorithm is typically memory-bound rather than compute-bound.
Bytes per FLOP (banded factorization):
- Each element accessed roughly once during elimination
- Storage: \((2k_l + k_u + 1) \cdot n \times 8\) bytes (double precision)
- FLOPs: \(n \cdot k_l \cdot (k_l + k_u)\)
- Ratio: \(\frac{8(2k_l + k_u + 1)}{k_l(k_l + k_u)}\) bytes/FLOP
For pentadiagonal ( \(k_l = k_u = 2\)): \(\frac{8 \times 7}{2 \times 4} = 7\) bytes/FLOP
Modern CPUs achieve ~50 GFLOPS but only ~50 GB/s memory bandwidth. At 7 bytes/FLOP, we're limited to ~7 GFLOPS — memory bound.
5. Numerical Stability
5.1 Partial Pivoting Guarantees
Theorem: LU factorization with partial pivoting satisfies:
\[\|L\|_\infty \leq 1 + k_l\]
for banded matrices, where the bound comes from at most \(k_l\) nonzeros per column of \(L\), each with magnitude \(\leq 1\).
Growth Factor: The growth factor \(\rho = \max_{ij}|u_{ij}| / \max_{ij}|a_{ij}|\) is bounded:
\[\rho \leq 2^{k_l}\]
This is much better than the \(2^{n-1}\) worst-case for dense matrices.
5.2 Condition Number and Error
The computed solution \(\hat{\mathbf{x}}\) satisfies:
\[\frac{\|\hat{\mathbf{x}} - \mathbf{x}\|}{\|\mathbf{x}\|} \lesssim \kappa(A) \cdot \epsilon_{\text{mach}}\]
where \(\kappa(A) = \|A\| \cdot \|A^{-1}\|\) is the condition number.
Condition number estimation for banded matrices:
- Use 1-norm: \(\|A\|_1 = \max_j \sum_i |a_{ij}|\) (maximum absolute column sum)
- Estimate \(\|A^{-1}\|_1\) via Hager's algorithm (one forward/back solve)
5.3 Diagonal Dominance
Definition: \(A\) is diagonally dominant if:
\[|a_{ii}| \geq \sum_{j \neq i} |a_{ij}|, \quad \forall i\]
with strict inequality for at least one row.
Theorem: For diagonally dominant banded matrices:
- No pivoting is needed (diagonal is always the largest element in column)
- No fill-in occurs beyond original bandwidth
- The factorization is unconditionally stable
Common diagonally dominant systems:
- Laplacian discretization (FD, FEM)
- Diffusion equations
- Many radiative transfer discretizations
6. Implementation Details
6.1 BandedMatrix Class Design
class BandedMatrix {
public:
BandedMatrix(idx n, idx kl, idx ku);
real& operator()(idx i, idx j);
real& band(idx band_row, idx col);
idx size() const { return n_; }
idx kl() const { return kl_; }
idx ku() const { return ku_; }
idx ldab()
const {
return 2*kl_ + ku_ + 1; }
private:
std::unique_ptr<real[]> data_;
};
Element access mapping:
real& BandedMatrix::operator()(idx i, idx j) {
return data_[(kl_ + ku_ + i - j) + j * ldab_];
}
6.2 SIMD Optimization
The inner loops in elimination and solve are vectorizable:
real inv_pivot = 1.0 / ab[kv + j * ldab];
#pragma omp simd
for (idx i = 1; i <= num_rows; ++i) {
ab[kv + i + j * ldab] *= inv_pivot;
}
#pragma omp simd
for (idx i = 1; i <= num_rows; ++i) {
ab[kv + i - k + k * ldab] -= ab[kv + i + j * ldab] * a_jk;
}
The #pragma omp simd directive hints to the compiler that iterations are independent and can use SIMD instructions (AVX-512 on modern Intel/AMD).
6.3 Cache Optimization
Blocking for L2 cache: For very wide bands ( \(k_l, k_u > 100\)), block the elimination:
for (idx jb = 0; jb < n; jb += BLOCK_SIZE) {
idx je = min(jb + BLOCK_SIZE, n);
for (idx j = jb; j < je; ++j) {
}
for (idx k = je; k < min(je + ku, n); ++k) {
}
}
6.4 Multiple Right-Hand Sides
When solving with multiple RHS vectors (common in radiative transfer for multiple wavelengths):
void banded_lu_solve_multi(const BandedMatrix& A, const idx* ipiv,
real* B, idx nrhs) {
#pragma omp parallel for if(nrhs > 16)
for (idx rhs = 0; rhs < nrhs; ++rhs) {
real* x = B + rhs * n;
for (idx i = 0; i < n; ++i)
if (ipiv[i] != i)
std::swap(x[i], x[ipiv[i]]);
}
}
Parallelization over RHS is embarrassingly parallel — no synchronization needed.
7. GPU Implementation
7.1 Batched Banded Solver
For GPU efficiency, we solve many independent banded systems in parallel:
__global__ void banded_lu_batched(
real* AB, // Packed band matrices (ldab × n × batch_size)
idx* ipiv, // Pivot arrays (n × batch_size)
idx n, idx kl, idx ku, idx ldab, idx batch_size)
{
idx batch_idx = blockIdx.x * blockDim.x + threadIdx.x;
if (batch_idx >= batch_size) return;
// Each thread handles one complete system
real* ab = AB + batch_idx * ldab * n;
idx* piv = ipiv + batch_idx * n;
idx kv = kl + ku;
for (idx j = 0; j < n; ++j) {
// Find pivot (sequential within thread)
idx pivot = j;
real max_val = fabs(ab[kv + j * ldab]);
for (idx i = j + 1; i <= min(j + kl, n - 1); ++i) {
real val = fabs(ab[kv + i - j + j * ldab]);
if (val > max_val) {
max_val = val;
pivot = i;
}
}
piv[j] = pivot;
// Swap and eliminate (same as CPU, but within thread)
// ...
}
}
7.2 When GPU is Beneficial
| Scenario | CPU | GPU | Winner |
| Single small system ( \(n < 1000\)) | 10 μs | 50 μs (launch overhead) | CPU |
| Single large system ( \(n > 10000\)) | 1 ms | 0.5 ms | GPU |
| 1000 small systems | 10 ms | 0.1 ms | GPU |
| 1000 large systems | 1 s | 50 ms | GPU |
The key insight: GPU excels at batched solves, which is exactly the pattern in radiative transfer (one system per spectral band or grid column).
8. Application: Radiative Transfer
8.1 Two-Stream Approximation
The radiative transfer equation for diffuse radiation:
\[\mu \frac{dI}{d\tau} = I - S\]
In the two-stream approximation (up/down fluxes \(F^+\), \(F^-\)), discretized over \(N\) atmospheric layers:
\[\begin{pmatrix}
T_1 & R_1 & & \\
R_1 & T_1 & R_2 & \\
& \ddots & \ddots & \ddots \\
& & R_{N-1} & T_N
\end{pmatrix}
\begin{pmatrix} F^+_1 \\ F^-_1 \\ \vdots \\ F^-_N \end{pmatrix}
=
\begin{pmatrix} S_1 \\ S_2 \\ \vdots \\ S_N \end{pmatrix}\]
where:
- \(T_i\) = transmission matrix for layer \(i\)
- \(R_i\) = reflection matrix coupling adjacent layers
- \(S_i\) = source terms (direct solar, thermal emission)
Matrix structure: Block tridiagonal with \(2 \times 2\) blocks → banded with \(k_l = k_u = 2\).
8.2 TUVX Photolysis Rates
TUVX (Tropospheric Ultraviolet-Visible) computes photolysis rate coefficients:
\[J_i = \int_\lambda \sigma_i(\lambda) \phi_i(\lambda) F(\lambda) \, d\lambda\]
where \(F(\lambda)\) is the actinic flux from the radiative transfer solve.
Computational pattern:
- For each wavelength band (~150 bands)
- For each atmospheric column (millions in global models)
- Solve tridiagonal/pentadiagonal system for fluxes
This is ideal for batched banded solvers!
8.3 Performance Considerations for NCAR Derecho
NCAR's Derecho supercomputer specifications:
- CPU: AMD EPYC Milan (128 cores/node)
- GPU: NVIDIA A100 (if using GPU partition)
- Memory bandwidth: ~400 GB/s per node
Optimization for Derecho:
# Build with AMD-specific optimizations
cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=znver3 -fopenmp -ffast-math"
Key flags:
-march=znver3: Zen 3 (Milan) specific instructions including AVX2
-fopenmp: Enable OpenMP for multi-RHS parallelization
-ffast-math: Allow reordering for SIMD (safe for banded solvers)
Expected performance:
- Tridiagonal ( \(n=100\) layers): ~0.5 μs per system
- Pentadiagonal ( \(n=100\) layers): ~1 μs per system
- Throughput: ~2M systems/second per core
9. Summary and Best Practices
9.1 Algorithm Selection Guide
| System Type | Recommended Solver |
| Tridiagonal, single | Thomas algorithm |
| Tridiagonal, batched | GPU batched Thomas |
| General banded, single | banded_solve() |
| General banded, reuse matrix | banded_lu() + banded_lu_solve() |
| General banded, multi-RHS | banded_lu_solve_multi() |
| Very large single system | Consider iterative (CG with banded preconditioner) |
9.2 Numerical Robustness Checklist
- Check for diagonal dominance — if satisfied, pivoting is unnecessary
- Estimate condition number — use
banded_rcond() after factorization
- Monitor pivot size — small pivots indicate near-singularity
- Use iterative refinement for ill-conditioned systems
9.3 Performance Optimization Checklist
- Factor once, solve many — amortize factorization cost
- Batch independent systems — exploit parallelism
- Align memory — 64-byte alignment for AVX-512
- Profile memory bandwidth — banded solvers are often memory-bound
- Consider mixed precision — factor in double, solve in single if accuracy permits
9.4 Code References
| Function | Purpose | Complexity |
BandedMatrix(n, kl, ku) | Construct band storage | \(O(n \cdot \text{ldab})\) |
banded_lu(A, ipiv) | LU factorization | \(O(n \cdot k_l \cdot (k_l + k_u))\) |
banded_lu_solve(A, ipiv, b) | Solve with factored matrix | \(O(n \cdot (k_l + k_u))\) |
banded_lu_solve_multi(A, ipiv, B, nrhs) | Multiple RHS solve | \(O(n \cdot (k_l + k_u) \cdot \text{nrhs})\) |
banded_solve(A, b, x) | One-shot solve | Factor + Solve |
banded_matvec(A, x, y) | Matrix-vector product | \(O(n \cdot (k_l + k_u))\) |
banded_norm1(A) | 1-norm | \(O(n \cdot (k_l + k_u))\) |
banded_rcond(A, ipiv, anorm) | Condition estimate | \(O(n \cdot (k_l + k_u))\) |
Appendix A: Derivation of Fill-in Bound
Claim: With partial pivoting, \(U\) has at most \(k_l + k_u\) super-diagonals.
Proof:
Consider eliminating column \(j\). The pivot row \(p\) satisfies \(j \leq p \leq j + k_l\).
After swapping row \(p\) into row \(j\):
- Original row \(p\) had nonzeros in columns \(\max(0, p - k_l)\) to \(\min(n-1, p + k_u)\)
- The rightmost nonzero is at column \(p + k_u \leq (j + k_l) + k_u = j + k_l + k_u\)
Since we're in row \(j\) (the pivot row), this creates fill-in up to column \(j + k_l + k_u\).
Thus \(U\) has upper bandwidth \(k_l + k_u\). \(\square\)
Appendix B: Diagonal Dominance Preserved Under Elimination
Claim: If \(A\) is strictly diagonally dominant, so is the Schur complement during elimination.
Proof:
After eliminating column 0, the \((1,1)\) element of the Schur complement is:
\[\tilde{a}_{11} = a_{11} - \frac{a_{10}}{a_{00}} a_{01}\]
The diagonal dominance of row 1 gives:
\[|a_{11}| > |a_{10}| + |a_{12}| + \cdots\]
We need to show \(|\tilde{a}_{11}| > |\tilde{a}_{12}| + \cdots\) where \(\tilde{a}_{1k} = a_{1k} - \frac{a_{10}}{a_{00}} a_{0k}\).
Using triangle inequality and the fact that \(|a_{10}/a_{00}| < 1\) (from diagonal dominance of row 0):
\[|\tilde{a}_{11}| \geq |a_{11}| - \left|\frac{a_{10}}{a_{00}}\right| |a_{01}|\]
and
\[|\tilde{a}_{1k}| \leq |a_{1k}| + \left|\frac{a_{10}}{a_{00}}\right| |a_{0k}|\]
Summing over \(k \neq 1\) and using diagonal dominance of rows 0 and 1:
\[\sum_{k \neq 1} |\tilde{a}_{1k}| < \sum_{k \neq 1} |a_{1k}| + \left|\frac{a_{10}}{a_{00}}\right| \sum_{k \neq 0} |a_{0k}|\]
\[< |a_{11}| - |a_{10}| + \left|\frac{a_{10}}{a_{00}}\right| \cdot |a_{00}|\]
\[= |a_{11}|\]
But we also have \(|\tilde{a}_{11}| > |a_{11}| - |a_{10}||a_{01}|/|a_{00}| > |a_{11}| - |a_{10}|\) since \(|a_{01}| < |a_{00}|\).
Combining these bounds shows \(|\tilde{a}_{11}| > \sum_{k \neq 1} |\tilde{a}_{1k}|\). \(\square\)
This explains why Thomas algorithm (no pivoting) is stable for diagonally dominant tridiagonal systems.