Floating Point Arithmetic and Error Analysis

Read these sections, which explain error analysis when numbers are represented using a fixed number of digits. This issue mostly arises when using floating-point numbers. Real numbers are represented using a fixed number of bits. The number of bits is the precision of the representation. The accuracy of the representation is described in terms of the difference between the actual number and its representation using a fixed number of bits. This difference is the error of the representation. The accuracy becomes more significant because computations can cause the error to get so large that the result is meaningless and potentially even high risk depending on the application. These sections also explain how programming languages approach number representation.

Round-off error analysis

The fact that floating point numbers can only represent a small fraction of all real numbers, means that in practical circumstances a computation will hardly ever be exact. In this section we will study the phenomenon that most real numbers can not be represented, and what it means for the accuracy of computations. This is commonly called round-off error analysis.

Representation error

Numbers that are too large or too small to be represented are uncommon: usually computations can be arranged so that this situation will not occur. By contrast, the case that the result of a computation between computer numbers (even something as simple as a single addition) is not representable is very common. Thus, looking at the implementation of an algorithm, we need to analyze the effect of such small errors propagating through the computation. We start by analyzing the error between numbers that can be represented exactly, and those close by that can not be.
If x is a number and \tilde{x} its representation in the computer, we call x-\tilde{x} the representation error or absolute representation error, and \frac{x-\tilde{x}}{x}  the relative representation error. Often we are not interested in the sign of the error, so we may apply the terms error and relative error to \left | x-\tilde{x} \right | and \left
    | \frac{x-\tilde{x}}{x} \right |  respectively.

Often we are only interested in bounds on the error. If is a bound on the error, we will write

\tilde{x} = x\pm \epsilon \equiv_D \left | x-\tilde{x}\right| \leq \epsilon \Leftrightarrow \tilde{x} \in [x-\epsilon, x+\epsilon]

For the relative error we note that

\tilde{x}=x(1+\epsilon)\Leftrightarrow \left | \frac{\tilde{x}-x}{x} \right | \leq \epsilon

Let us consider an example in decimal arithmetic, that is \beta = 10, and with a 3-digit mantissa: t = 3. The number x=.1256 has a representation that depends on whether we round or truncate: \tilde{x}_{round} = .126, \tilde{x}_{truncate} = .125. The error is in the 4th digit: if \epsilon = x-\tilde{x} then \left
    | \epsilon \right | < \beta^t

Exercise. The number in this example had no exponent part. What are the error and relative error if there had been one?

Correct rounding

The IEEE 754 standard, mentioned in section 3.2.4, does not only declare the way a floating point number is stored, it also gives a standard for the accuracy of operations such as addition, subtraction, multiplication, division. The model for arithmetic in the standard is that of correct rounding: the result of an operation should be as if the following procedure is followed:

  • The exact result of the operation is computed, whether this is representable or not;
  • This result is then rounded to the nearest computer number.
In short: the representation of the result of an operation is the rounded exact result of that operation. (Of course, after two operations it no longer needs to hold that the computed result is the exact rounded version of the exact result.)

If this statement sounds trivial or self-evident, consider subtraction as an example. In a decimal number system with two digits in the mantissa, the computation .10 - .94 \cdot 10^{-1} = .10 - .094 = .006 = .06 \cdot 10^{-2}. Note that in an intermediate step the mantissa .094 appears, which has one more digit than the two we declared for our number system. The extra digit is called a guard digit.

Without a guard digit, this operation would have proceeded as .10 - .94 \cdot 10^{-1}, where .94 \cdot 10^{-1} would be rounded to .09, giving a final result of .01, which is almost double the correct result.

Exercise 3.5. Consider the computation .10 - .95 \cdot 10^{-1}, and assume again that numbers are rounded to fit the 2-digit mantissa. Why is this computation in a way a lot worse than the example?

One guard digit is not enough to guarantee correct rounding. An analysis that we will not reproduce here shows that three extra bits are needed.


Addition of two floating point numbers is done in a couple of steps. First the exponents are aligned: the smaller of the two numbers is written to have the same exponent as the larger number. Then the mantissas are added. Finally, the result is adjusted so that it again is a normalized number.

As an example, consider .100 + .200 \times 10^{-2}. Aligning the exponents, this becomes .100 + .002 = .102, and this result requires no final adjustment. We note that this computation was exact, but the sum .100 + .255 \times 10^{-2} has the same result, and here the computation is clearly not exact. The error is \left
    | .10255 - .102 \right | < 10^{-3}, and we note that the mantissa has 3 digits, so there clearly is a relation with the machine precision here. 

In the example .615 \times 10^1 +.398 \times 10^1 = 1.013 \times 10^2 = .101 \times 10^1 we see that after addition of the mantissas an adjustment of the exponent is needed. The error again comes from truncating or rounding the first digit of the result that does not fit in the mantissa: if x is the true sum and \tilde{x} the computed sum, then  \tilde{x} = x(1+\epsilon) where, with a 3-digit mantissa \left | \epsilon \right | < 10^{-3}.

Formally, let us consider the computation of s = x_1 + x_2, and we assume that the numbers x_i are represented as \tilde{x}_i(1+\epsilon_i). Then the sum s is represented as

\tilde{s} = (\tilde{x}_1+\tilde{x}_2)(1+\epsilon_3 \\ =x_1(1+\epsilon_1)(1+\epsilon_3) + x_2(1+\epsilon_2)(1+\epsilon_3) \\ \approx x_1 (1+\epsilon_1+\epsilon_3) + x_2(1+\epsilon_1+\epsilon_3) \\ \approx s(1+2\epsilon)

sign under the assumptions that all \epsilon_i are small and of roughly equal size, and that both x_i > 0. We see that the relative errors are added under addition.


Floating point multiplication, like addition, involves several steps. In order to multiply two numbers .m_1 \times \beta^{e_1} and .m_2 \times \beta^{e_2}, the following steps are needed.

  • The exponents are added: e \leftarrow e_1 + e_2.
  • The mantissas are multiplied: m \leftarrow m_1 \times m_2.
  • The mantissa is normalized, and the exponent adjusted accordingly.

For example: .123 \cdot 10^0 \times .567 \cdot 10^1 = .069741 \cdot 10^1 \rightarrow .69741 \cdot 10^0 \rightarrow .697 \cdot 10^0.

What happens with relative errors?


Subtraction behaves very differently from addition. Whereas in addition errors are added, giving only a gradual increase of overall roundoff error, subtraction has the potential for greatly increased error in a single operation.

For example, consider subtraction with 3 digits to the mantissa: .124 - .123 = .001 \rightarrow .100 \cdot 10^{-2}. While the resultis exact, it has only one significant digit4. To see this, consider the case where the first operand .124 is actually the  rounded result of a computation that should have resulted in .1235. In that case, the result of the subtraction should have been .050 \cdot 10^{-2}, that is, there is a 100% error, even though the relative error of the inputs was as small as could be expected. Clearly, subsequent operations involving the result of this subtraction will also be inaccurate. We conclude that subtracting almost equal numbers is a likely cause of numerical roundoff.

There are some subtleties about this example. Subtraction of almost equal numbers is exact, and we have the correct rounding behaviour of IEEE arithmetic. Still, the correctness of a single operation does not imply that a sequence of operations containing it will be accurate. While the addition example showed only modest decrease of numerical accuracy, the cancellation in this example can have disastrous effects.


From the above, the reader may got the impression that roundoff errors only lead to serious problems in exceptional circumstances. In this section we will discuss some very practical examples where the inexactness of computer arithmetic becomes visible in the result of a computation. These will be fairly simple examples; more complicated examples exist that are outside the scope of this book, such as the instability of matrix inversion.

The 'abc-formula'

As a practical example, consider the quadratic equation ax^2+bx+c=0 which has solutions x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}Suppose b > 0 and b^2\gg
    4ac  and the '+' solution will be inaccurate. In this case it is better to compute x_-=\frac{-b-\sqrt{b^2-4ac}}{2a} and use x_+ \cdot x_- = -c/a.

Summing series

The previous example was about preventing a large roundoff error in a single operation. This example shows that even gradual buildup of roundoff error can be handled in different ways.

Consider the sum \sum_{n=1}^{10000} \frac{1}{n^2} = 1.644834 and assume we are working with single precision, which on most comput-ers means a machine precision of 10^{-7}. The problem with this example is that both the ratio between terms, and the ratio of terms to partial sums, is ever increasing. In section 3.2.3 we observed that a too large ratio can lead to one operand of an addition in effect being ignored.

If we sum the series in the sequence it is given, we observe that the first term is 1, so all partial sums \sum_{n}^{N} where N < 10000 are at least 1. This means that any term where 1/n^2
    < 10^{-7} gets ignored since it is less than the machine precision. Specifically, the last 7000 terms are ignored, and the computed sum is 1.644725. The first 4 digits are correct.

However, if we evaluate the sum in reverse order we obtain the exact result in single precision. We are still adding small quantities to larger ones, but now the ratio will never be as bad as one-to- , so the smaller number is never ignored. To see this, consider the ratio of two terms subsequent terms:

\frac{n^2}{(n-1)^2} = \frac{n^2}{n^2-2n+1} = \frac{1}{1-2/n +1/n^2} \approx 1+ \frac{2}{n}

Since we only sum 10^5 terms and the machine precision is 10^{-7}, in the addition 1/n^2 + 1/(n-1)^2 the second term will not be wholly ignored as it is when we sum from large to small.

Exercise. There is still a step missing in our reasoning. We have shown that in adding two subsequent terms, the smaller one is not ignored. However, during the calculation we add partial sums to the next term in the sequence. Show that this does not worsen the situation.

The lesson here is that series that are monotone (or close to monotone) should be summed from small to large, since the error is minimized if the quantities to be added are closer in magnitude. Note that this is the opposite strategy from the case of subtraction, where operations involving similar quantities lead to larger errors. This implies that if an application asks for adding and subtracting series of numbers, and we know a priori which terms are positive and negative, it may pay off to rearrange the algorithm accordingly.

Unstable algorithms

We will now consider an example where we can give a direct argument that the algorithm can not cope with problems due to inexactly represented real numbers.

Consider the recurrence y_n = \int_{0}^{1} \frac{x^n}{x-5}dx=\frac{1}{n}-5y_{n-1}. This is easily seen to be monotonically decreasing; the first 0 x 5 n term can be computed as y_0 = \ln6-\ln5 . Performing the computation in 3 decimal digits we get:

We see that the computed results are quickly not just inaccurate, but actually nonsensical. We can analyze why this is the case.

If we define the error \epsilon_n in the n-th step as

\tilde{y}_n - y_n = \epsilon_n


\tilde{y}_n = 1/n - 5\tilde{y}_{n-1} = 1/n + 5n_{n-1} + 5\epsilon_{n-1} = y_n + 5\epsilon_{n-1}

so \epsilon_n \geq 5\epsilon_{n-1}. The error made by this computation shows exponential growth.

Linear system solving

Sometimes we can make statements about the numerical precision of a problem even without specifying what algorithm we use. Suppose we want to solve a linear system, that is, we have an n \times n matrix A and a vector b of size n, and we want to compute the vector x such that Ax = b. (We will actually considering algorithms for this in chapter 5.) Since the vector \tilde{b} will the result of some computation or measurement, we are actually dealing with a vector \tilde{b}, which is some perturbation of the ideal b:

\tilde{b} = b +\Delta b

The perturbation vector \Delta b can be of the order of the machine precision if it only arises from representation error,or it can be larger, depending on the calculations that produced \tilde{b}.

We now ask what the relation is between the exact value of x, which we would have obtained from doing an exact calculation with A and b, which is clearly impossible, and the computed value \tilde{b}, which we get from computing with A and \tilde{b}. (In this discussion we will assume that A itself is exact, but this is a simplification.)

Writing \tilde{x} = x +\Delta x, the result of our computation is now

A \tilde{x} = \tilde{b}


A(x + \Delta x) = b + \Delta b

Since Ax = b, we get A \Delta x = \Delta b. From this, we get

The quantity \left \| A \right \|\left \| A^{-1} \right \| is called the condition number of a matrix. The bound then says that any perturbation in the right hand side can lead to a perturbation in the solution that is at most larger by the condition number of the matrix A. Note that it does not say that the perturbation in x needs to be anywhere close to that size, but we can not rule it out, and in some cases it indeed happens that this bound is attained.

Suppose that b is exact up to machine precision, and the condition number of A is 10^4. The bound is often interpreted as saying that the last 4 digits of x are unreliable, or that the computation 'loses 4 digits of accuracy'.

Roundoff error in parallel computations

From the above example of summing a series we saw that addition in computer arithmetic is not associative. A similar fact holds for multiplication. This has an interesting consequence for parallel computations: the way a computation is spread over parallel processors influences the result. For instance, consider computing the sum of a large number N of terms. With P processors at our disposition, we can let each compute N/P terms, and combine the partial results. We immediately see that for no two values of P will the results be identical. This means that reproducibility of results in a parallel context is elusive.

More about floating point arithmetic

Programming languages

Different languages have different approaches to storing integers and floating point numbers. In Fortran it is possible to specify the number of bytes that a number takes up: INTEGER*2, REAL*8. Often it is possible to write a code using only INTEGER, REAL, and use compiler flags to indicate the size of an integer and real number.

In C, the type identifiers have no standard length. For integers there is short int, int, long int, and for floating point float, double. The sizeof() operator gives the number of bytes used to store a datatype.C99, Fortran2003 Recent standards of the C and Fortran languages incorporate the C/Fortran interoperability stan-dard, which can be used to declare a type in one language so that it is compatible with a certain type in the other language.

Other computer arithmetic systems

Other systems have been proposed to dealing with the problems of inexact arithmetic on computers. One solution is extended precision arithmetic, where numbers are stored in more bits than usual. A common use of this is in the calculation of inner products of vectors: the accumulation is internally performed in extended precision, but returned as a regular floating point number. Alternatively, there are libraries such as GMPlib that allow for any calculation to be performed in higher precision.

Another solution to the imprecisions of computer arithmetic is 'interval arithmetic', where for each calculation interval bounds are maintained. While this has been researched for considerable time, it is not practically used other than through specialized libraries.

Fixed-point arithmetic

A fixed-point number can be represented as  (N, F} where  N \geq \beta^0 is the integer part and F < 1 is the fractional part.

Another way of looking at this, is that a fixed-point number is an integer stored in N + F digits, with an implied decimal point after the first N digits.

Fixed-point calculations can overflow, with no possibility to adjust an exponent. Consider the multiplication\left \langle N_1,F_1 \right \rangle \times \left \langle N_2,F_2 \right \rangle, where N_1 \geq \beta^{n_1} and N_2 \geq \beta^{n_2} . This overflows if n_1 + n_2 is more than the number of positions available for the integer part. (Informally, the number of digits of the product is the sum of the digits of the operands.) This means that, in a program that uses fixed-point, numbers will need to have a number of zero digits, if you are ever going to multiply them, which lowers the numerical accuracy. It also means that the programmer has to think harder about calculations, arranging them in such a way that overflow will not occur, and that numerical accuracy is still preserved to a reasonable extent.

So why would people use fixed-point numbers? One important application is in embedded low-power devices, think a battery-powered digital thermometer. Since fixed-point calculations are essentially identical to integer cal-culations, they do not require a floating-point unit, thereby lowering chip size and lessening power demands. Also, many early video game systems had a processor that either had no floating-point unit, or where the integer unit was considerably faster than the floating-point unit. In both cases, implementing non-integer calculations as fixed-point, using the integer unit, was the key to high throughput. 

Another area where fixed point arithmetic is still used, is in signal processing. In modern CPUs, integer and floating point operations are of essentially the same speed, but converting between them is relatively slow. Now, if the sine function is implemented through table lookup, this means that in sin(sin x) the output of a function is used to index the next function application. Obviously, outputting the sine function in fixed point obviates the need for conversion between real and integer quantities, which simplifies the chip logic needed, and speeds up calculations.

Complex numbers

Some programming languages have complex numbers as a native data type, others not, and others are in between. For instance, in Fortran you can declare

COMPLEX z1,z2, z(32)

COMPLEX*16 zz1, zz2, zz(36)

A complex number is a pair of real numbers, the real and imaginary part, allocated adjacent in memory. The first declaration then uses 8 bytes to store to REAL*4 numbers, the second one has REAL*8s for the real and imaginary part. (Alternatively, use DOUBLE COMPLEX or in Fortran90 COMPLEX(KIND=2) for the second line.)

By contrast, the C language does not natively have complex numbers, but both C99 and C++ have a complex.h header file5. This defines as complex number as in Fortran, as two real numbers.

Storing a complex number like this is easy, but sometimes it is computationally not the best solution. This becomes apparent when we look at arrays of complex numbers. If a computation often relies on access to the real (or imagi-nary) parts of complex numbers exclusively, striding through an array of complex numbers, has a stride two, which is disadvantageous (see section In this case, it is better to allocate one array for the real parts, and another for the imaginary parts.

Exercise. Suppose arrays of complex numbers are stored the Fortran way. Analyze the memory access pattern of pairwise multiplying the arrays, that is, \forall_i: c_i \leftarrow a_i \cdot b_i where a(), b(), c() are arrays of complex numbers.

Exercise. Show that an n \times n linear system Ax = b over the complex numbers can be written as a  2n \times 2n system over the real numbers. Hint: split the matrix and the vectors in their real and imaginary parts. Argue for the efficiency of storing arrays of complex numbers as separate arrays for the real and imaginary parts.

Source: Victor Eijkhout, Edmond Chow, and Robert van de Geijn, https://s3.amazonaws.com/saylordotorg-resources/wwwresources/site/textbookuploads/5345_scicompbook.pdf
Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 License.

Last modified: Tuesday, July 28, 2020, 5:08 PM