1. Introduction

Banded matrices arise naturally in the discretization of differential equations, particularly in:

Radiative transfer: Two-stream and multi-stream approximations (TUVX photolysis)
Finite differences: Higher-order stencils produce wider bands
Finite elements: Local basis functions create banded mass/stiffness matrices
Spline interpolation: B-spline collocation matrices

A banded matrix \(A \in \mathbb{R}^{n \times n}\) has bandwidth parameters \((k_l, k_u)\) where:

\(k_l\) = number of lower (sub-) diagonals
\(k_u\) = number of upper (super-) diagonals

The element \(a_{ij} = 0\) whenever \(j < i - k_l\) or \(j > i + k_u\).

Structure visualization (pentadiagonal, \(k_l = k_u = 2\)):

\[A = \begin{pmatrix} \times & \times & \times & 0 & 0 & \cdots \\ \times & \times & \times & \times & 0 & \cdots \\ \times & \times & \times & \times & \times & \cdots \\ 0 & \times & \times & \times & \times & \cdots \\ \vdots & & \ddots & \ddots & \ddots & \ddots \end{pmatrix}\]

Key advantage: Exploiting band structure reduces:

Storage: \(O(n \cdot (k_l + k_u + 1))\) instead of \(O(n^2)\)
Solve complexity: \(O(n \cdot k_l \cdot (k_l + k_u))\) instead of \(O(n^3)\)

2. Band Storage Formats

2.1 LAPACK Band Storage (Column-Major)

The standard format for band storage, used in LAPACK's DGBTRF/DGBTRS, stores diagonals in rows of a 2D array with leading dimension \(\text{ldab} = 2k_l + k_u + 1\).

Layout: For matrix element \(a_{ij}\) within the band:

\[\text{AB}(k_l + k_u + i - j, \, j) = a_{ij}\]

where AB is the band storage array with 0-based indexing.

Visual representation (for \(n=5\), \(k_l=2\), \(k_u=1\)):

Original matrix A:          Band storage AB (ldab=6, n=5):
 
| a00 a01  0   0   0  |     row 0: [ *   *   *   *   *  ]  (extra for fill-in)
| a10 a11 a12  0   0  |     row 1: [ *   *   *   *   *  ]  (extra for fill-in)
| a20 a21 a22 a23  0  |     row 2: [a01 a12 a23 a34  * ]  (upper diagonal)
|  0  a31 a32 a33 a34 |     row 3: [a00 a11 a22 a33 a44]  (main diagonal)
|  0   0  a42 a43 a44 |     row 4: [a10 a21 a32 a43  * ]  (lower diagonal 1)
                            row 5: [a20 a31 a42  *   * ]  (lower diagonal 2)

**Why extra \(k_l\) rows?** During LU factorization with row pivoting, fill-in can occur in the upper triangle up to \(k_l\) positions above the original upper bandwidth.

2.2 Memory Layout Benefits

Column-major storage (as in the layout above) ensures that:

LU factorization accesses memory sequentially when eliminating within a column
Forward/back substitution has good cache behavior
Direct LAPACK compatibility for fallback to optimized libraries

**Memory access pattern during elimination of column \(j\)**:

// Eliminating column j affects rows j+1 to min(j+kl, n-1)
// These are stored contiguously in column j of AB
for (idx i = 1; i <= min(kl, n-j-1); ++i) {
    AB[kv + i + j*ldab] *= inv_pivot;  // Contiguous access
}

3. LU Factorization for Banded Matrices

3.1 Mathematical Foundation

For a banded matrix \(A\) with bandwidths \((k_l, k_u)\), we compute \(PA = LU\) where:

\(P\) is a permutation matrix (from partial pivoting)
\(L\) is unit lower triangular with at most \(k_l\) sub-diagonals
\(U\) is upper triangular with at most \(k_l + k_u\) super-diagonals (due to fill-in)

Theorem (Band Preservation): If \(A\) has lower bandwidth \(k_l\) and upper bandwidth \(k_u\), then:

\(L\) has lower bandwidth \(k_l\)
\(U\) has upper bandwidth \(k_l + k_u\) (worst case with pivoting)

Proof: During Gaussian elimination of column \(j\):

Without pivoting: row \(i > j\) is modified by subtracting a multiple of row \(j\)
The rightmost nonzero in row \(j\) is at column \(j + k_u\)
The rightmost nonzero in row \(i\) (for \(i \leq j + k_l\)) is at column \(i + k_u\)
After elimination, row \(i\) can have nonzeros up to column \(\max(j + k_u, i + k_u) = j + k_u\) (since \(i > j\))

With pivoting, row \(j\) may come from row \(j + k_l\) at worst, creating fill-in up to column \((j + k_l) + k_u\) for original row \(j\). But this is stored in the extra \(k_l\) rows we allocated. \(\square\)

3.2 Algorithm Derivation

Standard Gaussian elimination adapted for band structure:

For column \(j = 0, 1, \ldots, n-1\):

Pivot selection (partial pivoting): Search rows \(j\) to \(\min(j + k_l, n-1)\) for largest \(|a_{ij}|\)
Row interchange: If pivot row \(p \neq j\), swap rows \(p\) and \(j\) (Only columns \(\max(0, j-k_u)\) to \(\min(j+k_l, n-1)\) need swapping)
Elimination: For rows \(i = j+1\) to \(\min(j + k_l, n-1)\):
- Compute multiplier: \(l_{ij} = a_{ij} / a_{jj}\)
- Update row \(i\): \(a_{ik} \leftarrow a_{ik} - l_{ij} \cdot a_{jk}\) for \(k = j+1\) to \(\min(j + k_u, n-1)\)

3.3 The Algorithm

Input: A in band storage (ldab × n), pivot array ipiv[n]
Output: LU factors in A, permutation in ipiv
 
kv = kl + ku  // Offset to main diagonal in band storage
 
for j = 0 to n-1:
    // 1. Find pivot in column j
    pivot_row = j
    max_val = |AB[kv, j]|
 
    for i = j+1 to min(j+kl, n-1):
        if |AB[kv + i - j, j]| > max_val:
            max_val = |AB[kv + i - j, j]|
            pivot_row = i
 
    ipiv[j] = pivot_row
 
    // 2. Check for singularity
    if AB[kv, j] == 0:
        return SINGULAR at row j
 
    // 3. Swap rows if needed
    if pivot_row ≠ j:
        for col = max(0, j-ku) to min(j+kl, n-1):
            swap AB[kv + j - col, col] and AB[kv + pivot_row - col, col]
 
    // 4. Compute multipliers and eliminate
    pivot_val = AB[kv, j]
    num_elim = min(kl, n - j - 1)
 
    // Scale column (compute L multipliers)
    for i = 1 to num_elim:
        AB[kv + i, j] /= pivot_val
 
    // Update trailing submatrix
    for k = j+1 to min(j + ku, n-1):
        a_jk = AB[kv + j - k, k]
        if a_jk ≠ 0:
            for i = 1 to num_elim:
                AB[kv + i + j - k, k] -= AB[kv + i, j] * a_jk

3.4 Solving After Factorization

Given \(PA = LU\), solve \(A\mathbf{x} = \mathbf{b}\) in three steps:

Step 1: Apply permutation ( \(\mathbf{b} \leftarrow P\mathbf{b}\))

for i = 0 to n-1:
    if ipiv[i] ≠ i:
        swap b[i] and b[ipiv[i]]

Step 2: Forward substitution (solve \(L\mathbf{y} = P\mathbf{b}\))

for j = 0 to n-1:
    if b[j] ≠ 0:
        for i = j+1 to min(j + kl, n-1):
            b[i] -= AB[kv + i - j, j] * b[j]

Step 3: Back substitution (solve \(U\mathbf{x} = \mathbf{y}\))

for j = n-1 down to 0:
    b[j] /= AB[kv, j]
    if b[j] ≠ 0:
        for i = max(0, j - ku) to j-1:
            b[i] -= AB[kv + i - j, j] * b[j]

4. Complexity Analysis

4.1 Operation Counts

LU Factorization:

For column \(j\), the number of operations is:

Pivot search: \(O(k_l)\) comparisons
Row swap: \(O(k_l + k_u)\) swaps (at most)
Multiplier computation: \(k_l\) divisions (for rows \(j+1\) to \(j+k_l\))
Submatrix update: \(k_l \times k_u\) multiply-adds

Summing over all columns:

\[\text{FLOPs} = \sum_{j=0}^{n-1} O(k_l \cdot k_u) = O(n \cdot k_l \cdot k_u)\]

More precisely, accounting for boundary effects:

\[\text{FLOPs} \approx n \cdot k_l \cdot (k_l + k_u) + O(k_l^2 \cdot k_u)\]

Forward/Back Substitution:

Forward: \(\sum_{j=0}^{n-1} O(k_l) = O(n \cdot k_l)\)
Back: \(\sum_{j=0}^{n-1} O(k_u) = O(n \cdot k_u)\)
Total solve: \(O(n \cdot (k_l + k_u))\)

4.2 Comparison with Dense and Tridiagonal

Method	Factor	Solve	Storage
Dense LU	\(\frac{2}{3}n^3\)	\(2n^2\)	\(n^2\)
Banded LU	\(n \cdot k_l \cdot (k_l + k_u)\)	\(2n(k_l + k_u)\)	\((2k_l + k_u + 1) \cdot n\)
Tridiagonal (Thomas)	\(8n\)	\(5n\)	\(4n\)

Example: \(n = 10000\), pentadiagonal ( \(k_l = k_u = 2\))

Dense: \(6.7 \times 10^{11}\) FLOPs for factor
Banded: \(10000 \times 2 \times 4 = 80000\) FLOPs for factor
Speedup: \(8.3 \times 10^6\) times faster!

4.3 Memory Bandwidth Analysis

For large \(n\), the algorithm is typically memory-bound rather than compute-bound.

Bytes per FLOP (banded factorization):

Each element accessed roughly once during elimination
Storage: \((2k_l + k_u + 1) \cdot n \times 8\) bytes (double precision)
FLOPs: \(n \cdot k_l \cdot (k_l + k_u)\)
Ratio: \(\frac{8(2k_l + k_u + 1)}{k_l(k_l + k_u)}\) bytes/FLOP

For pentadiagonal ( \(k_l = k_u = 2\)): \(\frac{8 \times 7}{2 \times 4} = 7\) bytes/FLOP

Modern CPUs achieve ~50 GFLOPS but only ~50 GB/s memory bandwidth. At 7 bytes/FLOP, we're limited to ~7 GFLOPS — memory bound.

5. Numerical Stability

5.1 Partial Pivoting Guarantees

Theorem: LU factorization with partial pivoting satisfies:

\[\|L\|_\infty \leq 1 + k_l\]

for banded matrices, where the bound comes from at most \(k_l\) nonzeros per column of \(L\), each with magnitude \(\leq 1\).

Growth Factor: The growth factor \(\rho = \max_{ij}|u_{ij}| / \max_{ij}|a_{ij}|\) is bounded:

\[\rho \leq 2^{k_l}\]

This is much better than the \(2^{n-1}\) worst-case for dense matrices.

5.2 Condition Number and Error

The computed solution \(\hat{\mathbf{x}}\) satisfies:

\[\frac{\|\hat{\mathbf{x}} - \mathbf{x}\|}{\|\mathbf{x}\|} \lesssim \kappa(A) \cdot \epsilon_{\text{mach}}\]

where \(\kappa(A) = \|A\| \cdot \|A^{-1}\|\) is the condition number.

Condition number estimation for banded matrices:

Use 1-norm: \(\|A\|_1 = \max_j \sum_i |a_{ij}|\) (maximum absolute column sum)
Estimate \(\|A^{-1}\|_1\) via Hager's algorithm (one forward/back solve)

5.3 Diagonal Dominance

Definition: \(A\) is diagonally dominant if:

\[|a_{ii}| \geq \sum_{j \neq i} |a_{ij}|, \quad \forall i\]

with strict inequality for at least one row.

Theorem: For diagonally dominant banded matrices:

No pivoting is needed (diagonal is always the largest element in column)
No fill-in occurs beyond original bandwidth
The factorization is unconditionally stable

Common diagonally dominant systems:

Laplacian discretization (FD, FEM)
Diffusion equations
Many radiative transfer discretizations

6. Implementation Details

6.1 BandedMatrix Class Design

class BandedMatrix {
public:
    BandedMatrix(idx n, idx kl, idx ku);
 
    // Element access in original matrix coordinates
    real& operator()(idx i, idx j);
 
    // Direct band storage access
    real& band(idx band_row, idx col);
 
    // Properties
    idx size() const { return n_; }
    idx kl() const { return kl_; }
    idx ku() const { return ku_; }
    idx ldab() const { return 2*kl_ + ku_ + 1; }
 
private:
    idx n_, kl_, ku_, ldab_;
    std::unique_ptr<real[]> data_;  // Column-major: data_[band_row + col*ldab_]
};

Element access mapping:

real& BandedMatrix::operator()(idx i, idx j) {
    // A(i,j) stored at band(kl + ku + i - j, j)
    return data_[(kl_ + ku_ + i - j) + j * ldab_];
}

6.2 SIMD Optimization

The inner loops in elimination and solve are vectorizable:

// Multiplier scaling (forward elimination)
real inv_pivot = 1.0 / ab[kv + j * ldab];
#pragma omp simd
for (idx i = 1; i <= num_rows; ++i) {
    ab[kv + i + j * ldab] *= inv_pivot;
}
 
// Trailing submatrix update
#pragma omp simd
for (idx i = 1; i <= num_rows; ++i) {
    ab[kv + i - k + k * ldab] -= ab[kv + i + j * ldab] * a_jk;
}

The #pragma omp simd directive hints to the compiler that iterations are independent and can use SIMD instructions (AVX-512 on modern Intel/AMD).

6.3 Cache Optimization

Blocking for L2 cache: For very wide bands ( \(k_l, k_u > 100\)), block the elimination:

// Process columns in blocks of BLOCK_SIZE
for (idx jb = 0; jb < n; jb += BLOCK_SIZE) {
    idx je = min(jb + BLOCK_SIZE, n);
 
    // Factor diagonal block
    for (idx j = jb; j < je; ++j) {
        // ... standard elimination for column j ...
    }
 
    // Update remaining columns (can be parallelized)
    for (idx k = je; k < min(je + ku, n); ++k) {
        // Update column k using factored columns jb:je
    }
}

6.4 Multiple Right-Hand Sides

When solving with multiple RHS vectors (common in radiative transfer for multiple wavelengths):

void banded_lu_solve_multi(const BandedMatrix& A, const idx* ipiv,
                           real* B, idx nrhs) {
    // B is n × nrhs, column-major
 
    #pragma omp parallel for if(nrhs > 16)
    for (idx rhs = 0; rhs < nrhs; ++rhs) {
        real* x = B + rhs * n;  // Column rhs
 
        // Apply permutation
        for (idx i = 0; i < n; ++i)
            if (ipiv[i] != i)
                std::swap(x[i], x[ipiv[i]]);
 
        // Forward substitution
        // ... (same as single RHS)
 
        // Back substitution
        // ... (same as single RHS)
    }
}

Parallelization over RHS is embarrassingly parallel — no synchronization needed.

7. GPU Implementation

7.1 Batched Banded Solver

For GPU efficiency, we solve many independent banded systems in parallel:

__global__ void banded_lu_batched(
    real* AB,        // Packed band matrices (ldab × n × batch_size)
    idx* ipiv,       // Pivot arrays (n × batch_size)
    idx n, idx kl, idx ku, idx ldab, idx batch_size)
{
    idx batch_idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (batch_idx >= batch_size) return;
 
    // Each thread handles one complete system
    real* ab = AB + batch_idx * ldab * n;
    idx* piv = ipiv + batch_idx * n;
 
    idx kv = kl + ku;
 
    for (idx j = 0; j < n; ++j) {
        // Find pivot (sequential within thread)
        idx pivot = j;
        real max_val = fabs(ab[kv + j * ldab]);
        for (idx i = j + 1; i <= min(j + kl, n - 1); ++i) {
            real val = fabs(ab[kv + i - j + j * ldab]);
            if (val > max_val) {
                max_val = val;
                pivot = i;
            }
        }
        piv[j] = pivot;
 
        // Swap and eliminate (same as CPU, but within thread)
        // ...
    }
}

7.2 When GPU is Beneficial

Scenario	CPU	GPU	Winner
Single small system ( \(n < 1000\))	10 μs	50 μs (launch overhead)	CPU
Single large system ( \(n > 10000\))	1 ms	0.5 ms	GPU
1000 small systems	10 ms	0.1 ms	GPU
1000 large systems	1 s	50 ms	GPU

The key insight: GPU excels at batched solves, which is exactly the pattern in radiative transfer (one system per spectral band or grid column).

8. Application: Radiative Transfer

8.1 Two-Stream Approximation

The radiative transfer equation for diffuse radiation:

\[\mu \frac{dI}{d\tau} = I - S\]

In the two-stream approximation (up/down fluxes \(F^+\), \(F^-\)), discretized over \(N\) atmospheric layers:

\[\begin{pmatrix} T_1 & R_1 & & \\ R_1 & T_1 & R_2 & \\ & \ddots & \ddots & \ddots \\ & & R_{N-1} & T_N \end{pmatrix} \begin{pmatrix} F^+_1 \\ F^-_1 \\ \vdots \\ F^-_N \end{pmatrix} = \begin{pmatrix} S_1 \\ S_2 \\ \vdots \\ S_N \end{pmatrix}\]

where:

\(T_i\) = transmission matrix for layer \(i\)
\(R_i\) = reflection matrix coupling adjacent layers
\(S_i\) = source terms (direct solar, thermal emission)

Matrix structure: Block tridiagonal with \(2 \times 2\) blocks → banded with \(k_l = k_u = 2\).

8.2 TUVX Photolysis Rates

TUVX (Tropospheric Ultraviolet-Visible) computes photolysis rate coefficients:

\[J_i = \int_\lambda \sigma_i(\lambda) \phi_i(\lambda) F(\lambda) \, d\lambda\]

where \(F(\lambda)\) is the actinic flux from the radiative transfer solve.

Computational pattern:

For each wavelength band (~150 bands)
For each atmospheric column (millions in global models)
Solve tridiagonal/pentadiagonal system for fluxes

This is ideal for batched banded solvers!

8.3 Performance Considerations for NCAR Derecho

NCAR's Derecho supercomputer specifications:

CPU: AMD EPYC Milan (128 cores/node)
GPU: NVIDIA A100 (if using GPU partition)
Memory bandwidth: ~400 GB/s per node

Optimization for Derecho:

# Build with AMD-specific optimizations

cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=znver3 -fopenmp -ffast-math"

Key flags:

-march=znver3: Zen 3 (Milan) specific instructions including AVX2
-fopenmp: Enable OpenMP for multi-RHS parallelization
-ffast-math: Allow reordering for SIMD (safe for banded solvers)

Expected performance:

Tridiagonal ( \(n=100\) layers): ~0.5 μs per system
Pentadiagonal ( \(n=100\) layers): ~1 μs per system
Throughput: ~2M systems/second per core

9. Summary and Best Practices

9.1 Algorithm Selection Guide

System Type	Recommended Solver
Tridiagonal, single	Thomas algorithm
Tridiagonal, batched	GPU batched Thomas
General banded, single	`banded_solve()`
General banded, reuse matrix	`banded_lu()` + `banded_lu_solve()`
General banded, multi-RHS	`banded_lu_solve_multi()`
Very large single system	Consider iterative (CG with banded preconditioner)

9.2 Numerical Robustness Checklist

Check for diagonal dominance — if satisfied, pivoting is unnecessary
Estimate condition number — use banded_rcond() after factorization
Monitor pivot size — small pivots indicate near-singularity
Use iterative refinement for ill-conditioned systems

9.3 Performance Optimization Checklist

Factor once, solve many — amortize factorization cost
Batch independent systems — exploit parallelism
Align memory — 64-byte alignment for AVX-512
Profile memory bandwidth — banded solvers are often memory-bound
Consider mixed precision — factor in double, solve in single if accuracy permits

9.4 Code References

Function	Purpose	Complexity
`BandedMatrix(n, kl, ku)`	Construct band storage	\(O(n \cdot \text{ldab})\)
`banded_lu(A, ipiv)`	LU factorization	\(O(n \cdot k_l \cdot (k_l + k_u))\)
`banded_lu_solve(A, ipiv, b)`	Solve with factored matrix	\(O(n \cdot (k_l + k_u))\)
`banded_lu_solve_multi(A, ipiv, B, nrhs)`	Multiple RHS solve	\(O(n \cdot (k_l + k_u) \cdot \text{nrhs})\)
`banded_solve(A, b, x)`	One-shot solve	Factor + Solve
`banded_matvec(A, x, y)`	Matrix-vector product	\(O(n \cdot (k_l + k_u))\)
`banded_norm1(A)`	1-norm	\(O(n \cdot (k_l + k_u))\)
`banded_rcond(A, ipiv, anorm)`	Condition estimate	\(O(n \cdot (k_l + k_u))\)

Appendix A: Derivation of Fill-in Bound

Claim: With partial pivoting, \(U\) has at most \(k_l + k_u\) super-diagonals.

Proof:

Consider eliminating column \(j\). The pivot row \(p\) satisfies \(j \leq p \leq j + k_l\).

After swapping row \(p\) into row \(j\):

Original row \(p\) had nonzeros in columns \(\max(0, p - k_l)\) to \(\min(n-1, p + k_u)\)
The rightmost nonzero is at column \(p + k_u \leq (j + k_l) + k_u = j + k_l + k_u\)

Since we're in row \(j\) (the pivot row), this creates fill-in up to column \(j + k_l + k_u\).

Thus \(U\) has upper bandwidth \(k_l + k_u\). \(\square\)

Appendix B: Diagonal Dominance Preserved Under Elimination

Claim: If \(A\) is strictly diagonally dominant, so is the Schur complement during elimination.

Proof:

After eliminating column 0, the \((1,1)\) element of the Schur complement is:

\[\tilde{a}_{11} = a_{11} - \frac{a_{10}}{a_{00}} a_{01}\]

The diagonal dominance of row 1 gives:

\[|a_{11}| > |a_{10}| + |a_{12}| + \cdots\]

We need to show \(|\tilde{a}_{11}| > |\tilde{a}_{12}| + \cdots\) where \(\tilde{a}_{1k} = a_{1k} - \frac{a_{10}}{a_{00}} a_{0k}\).

Using triangle inequality and the fact that \(|a_{10}/a_{00}| < 1\) (from diagonal dominance of row 0):

\[|\tilde{a}_{11}| \geq |a_{11}| - \left|\frac{a_{10}}{a_{00}}\right| |a_{01}|\]

and

\[|\tilde{a}_{1k}| \leq |a_{1k}| + \left|\frac{a_{10}}{a_{00}}\right| |a_{0k}|\]

Summing over \(k \neq 1\) and using diagonal dominance of rows 0 and 1:

\[\sum_{k \neq 1} |\tilde{a}_{1k}| < \sum_{k \neq 1} |a_{1k}| + \left|\frac{a_{10}}{a_{00}}\right| \sum_{k \neq 0} |a_{0k}|\]

\[< |a_{11}| - |a_{10}| + \left|\frac{a_{10}}{a_{00}}\right| \cdot |a_{00}|\]

\[= |a_{11}|\]

But we also have \(|\tilde{a}_{11}| > |a_{11}| - |a_{10}||a_{01}|/|a_{00}| > |a_{11}| - |a_{10}|\) since \(|a_{01}| < |a_{00}|\).

Combining these bounds shows \(|\tilde{a}_{11}| > \sum_{k \neq 1} |\tilde{a}_{1k}|\). \(\square\)

This explains why Thomas algorithm (no pivoting) is stable for diagonally dominant tridiagonal systems.