7 Floating Point Arithmetic

Digital computers represent real numbers using a finite number of bits. The result is not the real line, but a finite, non-uniform grid of representable numbers. Numerical analysis begins with this mismatch: the mathematical algorithm acts on real numbers, while the computer executes a nearby algorithm on rounded numbers.

Floating-point arithmetic is usually accurate locally: one elementary operation is rounded to a nearby representable number. Instability appears when an algorithm repeatedly amplifies these tiny local errors, or when it subtracts nearly equal quantities and exposes digits that were already contaminated by rounding.

7.1 Floating Point Representation

Most modern systems adhere to the IEEE 754 standard, which represents numbers in a scientific-like notation. Normalized numbers take the form \(x = (-1)^s \times (1.f) \times 2^e\).

s: Sign bit, determining whether the number is positive or negative.
f: Fractional significand (mantissa); the leading 1 is implicit to maximize precision.
e: Biased exponent, allowing both very large and very small numbers to be represented.

The exponent controls scale and the significand controls precision. This means floating-point numbers have roughly fixed relative spacing, not fixed absolute spacing. There are many representable numbers near zero and far fewer, in absolute terms, near \(10^{100}\). The gap between adjacent numbers near \(x\) is proportional to \(|x|\).

Remark

(IEEE 754 Standards)

Single (32-bit): 1 sign, 8 exponent, 23 fraction bits.
Double (64-bit): 1 sign, 11 exponent, 52 fraction bits.

Example

(16-bit float) 1 sign, 5 exp (bias 15), 10 frac. Represent \(-3.25\):

\(s=1\).
\(3.25_{10} = 11.01_2 = 1.101_2 \times 2^1\).
\(e = 1+15 = 16 = 10000_2\).
\(f = 1010000000_2\).

Result: 1 10000 1010000000.

Exercise

Encode \(+5.5\) in 16-bit format.
How many distinct normalized numbers for 5 exp bits, 10 frac bits?

Definition: Machine Epsilon (\(\varepsilon_{\text{mach}}\))

Machine epsilon is the distance from \(1\) to the next larger floating-point number. Equivalently, it is the smallest positive floating-point number \(\varepsilon\) such that \(\text{fl}(1+\varepsilon) > 1\). It characterizes the spacing of representable numbers near \(1\) and is given by \(\varepsilon_{\text{mach}} = 2^{1-p}\), where \(p\) is the number of bits in the significand.

For double precision (64-bit), \(\varepsilon_{\text{mach}} \approx 2.22 \times 10^{-16}\).

Remark

(Spacing rule) Near a normalized number \(x\), the distance to the next representable number is approximately \(\varepsilon_{\text{mach}}|x|\). Near \(10^{16}\), the gap is on the order of \(1\); near \(1\), the gap is on the order of \(10^{-16}\). This is why adding a small number to a huge number may do nothing.

Theorem: Absorption Criterion

In round-to-nearest arithmetic, if \(|y|\) is less than half the spacing between adjacent floating-point numbers near \(x\), then \[ \begin{align} \operatorname{fl}(x+y)=x. \end{align} \] Equivalently, small increments disappear when they are below the local resolution of the floating-point grid.

Exercise

Use the result above to predict whether 1.0 + 1e-16 and 1.0 + 2e-16 differ from 1.0; then check in Python.
Find \(\varepsilon_{\text{mach}}\) empirically via loop.
Use the result above to predict the smallest power of \(10\) for which 1e16 + 10**k changes 1e16.
Explain why the answer changes if 1e16 is replaced by 1.0.

Definition: Unit Roundoff (\(u\))

Unit roundoff is the maximum relative error caused by rounding one real number to the nearest floating-point number: \[ \begin{align} \frac{|\text{fl}(x) - x|}{|x|} \leq u. \end{align} \] For round-to-nearest arithmetic, \(u = \varepsilon_{\text{mach}}/2 = 2^{-p}\).

Remark

(Machine epsilon vs. unit roundoff) \(\varepsilon_{\text{mach}}\) is a spacing near \(1\). The unit roundoff \(u\) is an error bound for rounding to nearest. Many texts use ``machine epsilon’’ for both quantities, so always check the convention. In these notes, elementary rounding errors are bounded by \(u\).

Theorem: Fundamental Axiom

IEEE 754 guarantees \(\text{fl}(a \circ b) = (a \circ b)(1 + \delta)\) for \(\circ \in \{+, -, \times, \div\}\) and \(|\delta| \leq u\). Individual operations are as accurate as the representation allows.

Proof

The real line near \(x\) is partitioned into intervals, each containing exactly one floating point number. In round-to-nearest mode, \(\text{fl}(x)\) is the closer endpoint, so: \[ \begin{align} |\text{fl}(x) - x| \leq \frac{1}{2} \text{gap} = \frac{1}{2} (\varepsilon_{\text{mach}} |x|) = u|x|. \end{align} \] Dividing by \(|x|\) gives the relative error bound \(|\delta| \leq u\). For any operation \(\circ\), IEEE 754 mandates computing the exact result first, then rounding, ensuring: \[ \begin{align} \text{fl}(a \circ b) = \text{round}(a \circ b) = (a \circ b)(1 + \delta). \end{align} \]

Remark

Warning: After \(n\) operations, error often grows like \(O(nu)\) when errors do not systematically amplify. This is a rule of thumb, not a guarantee. Operation order matters because each intermediate result is rounded.

Exercise

Let \(a=1, b=-0.999...\). Is error in \(\text{fl}(a+b)\) bounded by \(u\)?
Estimate worst-case relative error for \(10^6\) sequential additions.

7.2 IEEE 754 Special Values

Signed Zero: \(+0, -0\). \(1/(+0) = \infty\), \(1/(-0) = -\infty\).
Subnormals: Fill the gap between 0 and the smallest normalized number. Reduced precision, allows gradual underflow.
Infinities: \(\pm\infty\). Result of overflow or \(x/0\).
NaN: Not a Number (\(0/0\), \(\infty-\infty\)). Unequal to everything, including itself.

Remark

(NaN propagation) A NaN usually contaminates all later arithmetic. This is useful for debugging: it marks an invalid operation that happened earlier, even if the failure is observed much later.

Exercise

Python: 1.0/0.0 vs 1.0/-0.0.
Result of float('inf') - float('inf') and float('nan') == float('nan').

7.3 Arithmetic Pitfalls

Because floating-point numbers are discrete, the standard laws of algebra do not always hold. In particular, operations are neither associative nor distributive in general.

Non-Associativity: The order of operations can change the result, i.e., \((a+b)+c \neq a+(b+c)\).
Non-Distributivity: \(a(b+c) \neq ab + ac\).
Absorption: If \(|a|\) is much larger than \(|b|\), then \(a+b\) may equal \(a\) exactly because \(b\) is smaller than the gap between representable numbers near \(a\).

Example

(Absorption) In double precision, 1e16 + 1.0 == 1e16 is true. The number \(1\) is smaller than the spacing between adjacent representable numbers near \(10^{16}\), so it is rounded away. Consequently (1e16 + 1.0) - 1e16 returns 0.0, not 1.0.

Example

(Non-associativity) Let \(a=10^{16}\), \(b=-10^{16}\), and \(c=1\). Then \[ \begin{align} (a+b)+c = 1, \qquad a+(b+c) = 0 \end{align} \] in double precision. The second expression first rounds \(b+c\) back to \(-10^{16}\), so the \(1\) is lost before the final addition.

Remark

(Summation rule) Add numbers of similar magnitude when possible. Summing from smallest to largest, pairwise summation, or compensated summation reduces the loss of small terms. This is why library reductions may give different answers than a hand-written loop while still being more accurate.

Exercise

Before running Python, use the result above to predict (1e16 + 1.0) - 1e16.
a=1e16, b=1. Find (a+b)-a in Python. Explain result.
Find \(a, b, c\) where a*(b+c) != a*b + a*c.
Before running Python, predict the results of summing \([10^{16}, 1, -10^{16}]\) in the listed order and in the order \([10^{16}, -10^{16}, 1]\).
Compare naive summation, pairwise summation, and Kahan summation on a list containing one large number and many small numbers. Which method best preserves the small terms?

8 Quantifying Approximation Error

8.1 Absolute and Relative Errors

Absolute Error: \(|\hat{x} - x|\).
Relative Error: \(\varepsilon = \frac{|\hat{x} - x|}{|x|}\). This is the standard measure for precision in numerical computing.

Absolute error measures the size of the mistake in the original units. Relative error measures the size of the mistake compared to the quantity being computed. In numerical computing, relative error is usually more informative: an error of \(10^{-6}\) is excellent for a number of size \(10^3\) and terrible for a number of size \(10^{-9}\).

Remark

(Comparing floating-point numbers) Avoid testing computed real numbers with exact equality. Use a tolerance such as \[ \begin{align} |\hat{x}-x| \leq \text{atol} + \text{rtol}|x|. \end{align} \] This is the logic behind np.isclose. Absolute tolerance handles values near zero; relative tolerance handles scale.

Exercise

\(\sqrt{2} \approx 1.414\). Errors?
Relative error of product \(\hat{x}\hat{y}\)? (Sum of individual relative errors).

Definition: Catastrophic Cancellation

The most dangerous error in numerical computing is catastrophic cancellation. This occurs when two nearly equal numbers are subtracted, causing the most significant digits to cancel out and leaving only the rounding errors as the “leading” digits.

Relative error bound: \(\varepsilon_{\text{rel}} \leq u \cdot \frac{x+y}{x-y}\).
As \(x \to y\), the amplification factor \(\frac{x+y}{x-y}\) tends to infinity.

Proof

Let \(\hat{x} = x(1+\varepsilon_x)\) and \(\hat{y} = y(1+\varepsilon_y)\) with \(|\varepsilon_x|, |\varepsilon_y| \leq u\). The error in the subtraction is: \[ \begin{align} (\hat{x}-\hat{y})-(x-y) &= x(1+\varepsilon_x) - y(1+\varepsilon_y) - x + y \\ &= x\varepsilon_x - y\varepsilon_y. \end{align} \] The relative error is bounded by: \[ \begin{align} \frac{|x\varepsilon_x - y\varepsilon_y|}{x-y} &\leq \frac{x|\varepsilon_x| + y|\varepsilon_y|}{x-y} \\ &\leq \frac{xu + yu}{x-y} \\ &= u \cdot \frac{x+y}{x-y}. \end{align} \] As \(x \to y\), the denominator \(x-y \to 0\), causing the relative error to explode.

Remark

Numerical Stability Principle: Rounding error is an inevitable consequence of finite precision. However, catastrophic cancellation is an algorithmic flaw that can often be avoided by mathematically rearranging the problem.

Example

(Stabilizing Cancellation) \(f(x) = \sqrt{x+1} - \sqrt{x}\) for \(x = 10^8\).

Direct: \(10000.00000005 - 10000 \approx 0\) (Catastrophic).
Stable: \(\frac{1}{\sqrt{x+1} + \sqrt{x}} \approx 5 \times 10^{-5}\).

Example

(Quadratic formula) The roots of \(ax^2+bx+c=0\) are \[ \begin{align} x_{\pm} = \frac{-b \pm \sqrt{b^2-4ac}}{2a}. \end{align} \] If \(b^2 \gg 4ac\), one numerator subtracts nearly equal numbers. A stable strategy computes the root with no cancellation first, then uses \(x_1x_2=c/a\) to recover the other root.

Remark

(Use stable library functions) Functions such as log1p(x), expm1(x), and hypot(x, y) implement stable rearrangements for common cancellation and overflow patterns. Prefer them over hand-expanded formulas when the inputs may be small, large, or nearly cancelling.

Exercise

Stabilize:

Use the result above to explain why \((1-\cos x)/x^2\) is unstable for small \(x\), then derive a stable form.
Use the result above to identify which root of \(x^2 - 10^8 x + 1 = 0\) is unstable in the quadratic formula, then compute it using \(x_1x_2=c/a\).
Use the Taylor series of \(e^x\) to explain why \(e^x - 1\) loses relative accuracy for small \(x\). Compare with np.expm1.

8.2 Error Analysis Frameworks

Forward Error Analysis: Tracks the accumulation of error from inputs to outputs.
Backward Error Analysis: Determines the input perturbation that would make the computed result exact.
Backward Stability: An algorithm is backward stable if the backward error is \(O(u)\), where \(u\) is the unit roundoff.

Forward error asks whether the computed answer is close to the exact answer. Backward error asks whether the computed answer is the exact answer to a nearby problem. Backward error is often easier to prove and more useful in practice: if the data are uncertain at the level of the backward error, the algorithm has not introduced a meaningful additional error.

Definition: Condition Number (\(\kappa\))

Inherent sensitivity of the problem. \(\text{Forward Error} \leq \kappa \cdot \text{Backward Error}\).

Remark

(Stability vs. conditioning) Conditioning is a property of the mathematical problem. Stability is a property of the algorithm. A stable algorithm can still return a poor answer on an ill-conditioned problem, because the problem itself amplifies small perturbations.

Exercise

\(\kappa(A) = 10^8, \text{Backward Error} = 10^{-15}\). Max forward error?
Prove BE for linear systems: \((\bA + \bE)\hat{\bx} = \bb\) for \(\bE = -\br\hat{\bx}^T/\|\hat{\bx}\|^2\).

Proof

(1): Substituting \(\bE\) into \((\bA + \bE)\hat{\bx}\): \[ \begin{align} (\bA + \bE)\hat{\bx} &= \bA\hat{\bx} + \bE\hat{\bx} \\ &= \bA\hat{\bx} - \frac{\br\hat{\bx}^T}{\|\hat{\bx}\|^2} \hat{\bx} \\ &= \bA\hat{\bx} - \br \frac{\hat{\bx}^T\hat{\bx}}{\|\hat{\bx}\|^2} \\ &= \bA\hat{\bx} - \br. \end{align} \] Since \(\br = \bb - \bA\hat{\bx}\), we have \(\bA\hat{\bx} - (\bb - \bA\hat{\bx}) = \bb\).

(2): For the relative error, using the rank-1 norm property \(\|\br\bv^T\|_2 = \|\br\|_2\|\bv\|_2\): \[ \begin{align} \|\bE\|_2 &= \left\| \frac{\br\hat{\bx}^T}{\|\hat{\bx}\|^2} \right\|_2 \\ &= \frac{\|\br\|_2 \|\hat{\bx}\|_2}{\|\hat{\bx}\|^2} = \frac{\|\br\|_2}{\|\hat{\bx}\|_2}. \end{align} \] Dividing by \(\|\bA\|_2\) gives the result: \[ \begin{align} \frac{\|\bE\|_2}{\|\bA\|_2} = \frac{\|\br\|_2}{\|\bA\|_2 \|\hat{\bx}\|_2}. \end{align} \]