Digital computers represent real numbers using a finite number of bits, leading to a discrete and non-uniform sampling of the real line. Understanding the limits of this representation is critical for designing robust numerical algorithms.
Floating Point Representation
Most modern systems adhere to the IEEE 754 standard, which represents numbers in a scientific-like notation. Normalized numbers take the form \(x = (-1)^s \times (1.f) \times 2^e\).
s: Sign bit, determining whether the number is positive or negative.
f: Fractional significand (mantissa); the leading 1 is implicit to maximize precision.
e: Biased exponent, allowing both very large and very small numbers to be represented.
IEEE 754 Standards.
Single (32-bit): 1 sign, 8 exponent, 23 fraction bits.
Double (64-bit): 1 sign, 11 exponent, 52 fraction bits.
16-bit float. 1 sign, 5 exp (bias 15), 10 frac. Represent \(-3.25\):
\(s=1\).
\(3.25_{10} = 11.01_2 = 1.101_2 \times 2^1\).
\(e = 1+15 = 16 = 10000_2\).
\(f = 1010000000_2\).
Result: 1 10000 1010000000.
Encode \(+5.5\) in 16-bit format.
How many distinct normalized numbers for 5 exp bits, 10 frac bits?
Machine epsilon is the smallest positive number that, when added to 1, yields a result different from 1. It characterizes the “gap” between representable numbers and is given by \(\varepsilon_{\text{mach}} = 2^{1-p}\), where \(p\) is the number of bits in the significand.
- For double precision (64-bit), \(\varepsilon_{\text{mach}} \approx 2.22 \times 10^{-16}\).
Python: Compare 1.0 + 1e-16 and 1.0 + 2e-16.
Find \(\varepsilon_{\text{mach}}\) empirically via loop.
Max relative rounding error: \(u = \varepsilon_{\text{mach}}/2 = 2^{-p}\). \[
\begin{align}
\frac{|\text{fl}(x) - x|}{|x|} \leq u.
\end{align}
\]
IEEE 754 guarantees \(\text{fl}(a \circ b) = (a \circ b)(1 + \delta)\) for \(\circ \in \{+, -, \times, \div\}\) and \(|\delta| \leq u\). Individual operations are as accurate as the representation allows.
The real line near \(x\) is partitioned into intervals, each containing exactly one floating point number. In round-to-nearest mode, \(\text{fl}(x)\) is the closer endpoint, so: \[
\begin{align}
|\text{fl}(x) - x| \leq \frac{1}{2} \text{gap} = \frac{1}{2} (\varepsilon_{\text{mach}} |x|) = u|x|.
\end{align}
\] Dividing by \(|x|\) gives the relative error bound \(|\delta| \leq u\). For any operation \(\circ\), IEEE 754 mandates computing the exact result first, then rounding, ensuring: \[
\begin{align}
\text{fl}(a \circ b) = \text{round}(a \circ b) = (a \circ b)(1 + \delta).
\end{align}
\]
Warning: After \(n\) operations, error grows as \(O(nu)\). Operation order matters because each intermediate result is rounded.
Let \(a=1, b=-0.999...\). Is error in \(\text{fl}(a+b)\) bounded by \(u\)?
Estimate worst-case relative error for \(10^6\) sequential additions.
IEEE 754 Special Values
Signed Zero: \(+0, -0\). \(1/(+0) = \infty\), \(1/(-0) = -\infty\).
Subnormals: Fill the gap between 0 and the smallest normalized number. Reduced precision, allows gradual underflow.
Infinities: \(\pm\infty\). Result of overflow or \(x/0\).
NaN: Not a Number (\(0/0\), \(\infty-\infty\)). Unequal to everything, including itself.
Python: 1.0/0.0 vs 1.0/-0.0.
Result of float('inf') - float('inf') and float('nan') == float('nan').
Arithmetic Pitfalls
Because floating-point numbers are discrete, the standard laws of algebra do not always hold. In particular, operations are neither associative nor distributive in general.
Non-Associativity: The order of operations can change the result, i.e., \((a+b)+c \neq a+(b+c)\).
Non-Distributivity: \(a(b+c) \neq ab + ac\).
Absorption: If \(|a|\) is much larger than \(|b|\), then \(a+b\) may equal \(a\) exactly because \(b\) is smaller than the gap between representable numbers near \(a\).
a=1e16, b=1. Find (a+b)-a in Python. Explain result.
Find \(a, b, c\) where a*(b+c) != a*b + a*c.
Quantifying Approximation Error
Absolute and Relative Errors
\(\sqrt{2} \approx 1.414\). Errors?
Relative error of product \(\hat{x}\hat{y}\)? (Sum of individual relative errors).
The most dangerous error in numerical computing is catastrophic cancellation. This occurs when two nearly equal numbers are subtracted, causing the most significant digits to cancel out and leaving only the rounding errors as the “leading” digits.
Relative error bound: \(\varepsilon_{\text{rel}} \leq u \cdot \frac{x+y}{x-y}\).
As \(x \to y\), the amplification factor \(\frac{x+y}{x-y}\) tends to infinity.
Let \(\hat{x} = x(1+\varepsilon_x)\) and \(\hat{y} = y(1+\varepsilon_y)\) with \(|\varepsilon_x|, |\varepsilon_y| \leq u\). The error in the subtraction is: \[
\begin{align}
(\hat{x}-\hat{y})-(x-y) &= x(1+\varepsilon_x) - y(1+\varepsilon_y) - x + y \\
&= x\varepsilon_x - y\varepsilon_y.
\end{align}
\] The relative error is bounded by: \[
\begin{align}
\frac{|x\varepsilon_x - y\varepsilon_y|}{x-y} &\leq \frac{x|\varepsilon_x| + y|\varepsilon_y|}{x-y} \\
&\leq \frac{xu + yu}{x-y} \\
&= u \cdot \frac{x+y}{x-y}.
\end{align}
\] As \(x \to y\), the denominator \(x-y \to 0\), causing the relative error to explode.
Numerical Stability Principle: Rounding error is an inevitable consequence of finite precision. However, catastrophic cancellation is an algorithmic flaw that can often be avoided by mathematically rearranging the problem.
Stabilizing Cancellation. \(f(x) = \sqrt{x+1} - \sqrt{x}\) for \(x = 10^8\).
Stabilize:
\((1-\cos x)/x^2\) for small \(x\).
Smaller root of \(x^2 - 10^8 x + 1 = 0\) via quadratic formula.
\(e^x - 1\) for small \(x\) (Why use np.expm1?).
Error Analysis Frameworks
Forward Error Analysis: Tracks the accumulation of error from inputs to outputs.
Backward Error Analysis: Determines the input perturbation that would make the computed result exact.
Backward Stability: An algorithm is backward stable if the backward error is \(O(u)\), where \(u\) is the unit roundoff.
Inherent sensitivity of the problem. \(\text{Forward Error} \leq \kappa \cdot \text{Backward Error}\).
\(\kappa(A) = 10^8, \text{Backward Error} = 10^{-15}\). Max forward error?
Prove BE for linear systems: \((\bA + \bE)\hat{\bx} = \bb\) for \(\bE = -\br\hat{\bx}^T/\|\hat{\bx}\|^2\).
(1): Substituting \(\bE\) into \((\bA + \bE)\hat{\bx}\): \[
\begin{align}
(\bA + \bE)\hat{\bx} &= \bA\hat{\bx} + \bE\hat{\bx} \\
&= \bA\hat{\bx} - \frac{\br\hat{\bx}^T}{\|\hat{\bx}\|^2} \hat{\bx} \\
&= \bA\hat{\bx} - \br \frac{\hat{\bx}^T\hat{\bx}}{\|\hat{\bx}\|^2} \\
&= \bA\hat{\bx} - \br.
\end{align}
\] Since \(\br = \bb - \bA\hat{\bx}\), we have \(\bA\hat{\bx} - (\bb - \bA\hat{\bx}) = \bb\).
(2): For the relative error, using the rank-1 norm property \(\|\br\bv^T\|_2 = \|\br\|_2\|\bv\|_2\): \[
\begin{align}
\|\bE\|_2 &= \left\| \frac{\br\hat{\bx}^T}{\|\hat{\bx}\|^2} \right\|_2 \\
&= \frac{\|\br\|_2 \|\hat{\bx}\|_2}{\|\hat{\bx}\|^2} = \frac{\|\br\|_2}{\|\hat{\bx}\|_2}.
\end{align}
\] Dividing by \(\|\bA\|_2\) gives the result: \[
\begin{align}
\frac{\|\bE\|_2}{\|\bA\|_2} = \frac{\|\br\|_2}{\|\bA\|_2 \|\hat{\bx}\|_2}.
\end{align}
\]