## Integers and the Representation of Real Numbers

Read these sections on the representation of integers and real numbers. Earlier, you read about number systems and the representation of numbers used for computing. This will give you a chance to review that material. Computer architecture comprises components which perform the functions of storage of data, transfer of data from one component to another, computations, and interfacing to devices external to the computer. Data is stored in terms of units, called words. A word is made up of a number of bits, typically, depending on the computer, 32 bits or 64 bits. Words keep getting longer, with larger numbers of bits. Instructions are also stored in words. Before, you saw examples of how instructions are stored in a word or words. Now, you will see how numbers are stored in words.

#### Integers

In scientific computing, most operations are on real numbers. Computations on integers rarely add up to any serious computation load. It is mostly for completeness that we start with a short discussion of integers.

Integers are commonly stored in 16, 32, or 64 bits, with 16 becoming less common and 64 becoming more and more so. The main reason for this increase is not the changing nature of computations, but the fact that integers are used to index arrays. As the size of data sets grows (in particular in parallel computations), larger indices are needed. For instance, in 32 bits one can store the numbers zero through $2^{32}-1\approx 4 \cdot 10^{9}$. In other words, a 32 bit index can address 4 gigabytes of memory. Until recently this was enough for most purposes; these days the need for larger data sets has made 64 bit indexing necessary.

When we are indexing an array, only positive integers are needed. In general integer computations, of course, we need to accommodate the negative integers too. There are several ways of implementing negative integers. The simplest solution is to reserve one bit as a sign bit, and use the remaining 31 (or 15 or 63; from now on we will consider 32 bits the standard) bits to store the absolute magnitude.

This scheme has some disadvantages, one being that there is both a positive and negative number zero. This means that a test for equality becomes more complicated than simply testing for equality as a bitstring.

The scheme that is used most commonly is called 2’s complement, where integers are represented as follows.

• If  $0\leq m\leq 2^{31} - 1$, the normal bit pattern for m is used.
• If $1\leq n\leq 2^{31}$, then $-n$ is represented by the bit pattern for $2^{32}-n$.

Some observations:

• There is no overlap between the bit patterns for positive and negative integers, in particular, there is only one pattern for zero.
• The positive numbers have a leading bit zero, the negative numbers have the leading bit set.

Exercise 3.1. For the ‘naive’ scheme and the 2’s complement scheme for negative numbers, give pseudocode for the comparison test m < n, where m and n are integers. Be careful to distinguish between all cases of m; n positive, zero, or negative.

Adding two numbers with the same sign, or multiplying two numbers of any sign, may lead to a result that is too large or too small to represent. This is called overflow.

Exercise 3.2. Investigate what happens when you perform such a calculation. What does your compiler say if you try to write down a nonrepresentible number explicitly, for instance in an assignment statement?

In exercise 1 above you explored comparing two integers. Let us know explore how subtracting numbers in two’s complement is implemented. Consider $0\leq m \leq 2^{31} -1$ and $1\leq n \leq 2^{31}$ a nd let us see what happens in the computation of $m-n$.

Suppose we have an algorithm for adding and subtracting unsigned 32-bit numbers. Can we use that to subtract two’s complement integers? We start by observing that the integer subtraction $m-n$ becomes the unsigned addition $m + (2^{32}-n)$.

• Case: $m < n$. Since $m + (2^{32} - n) = 2^{32} - (n - m)$ and $1 \leq n - m \leq 2^{31}, we conclude that \(2^{32} - (n - m)$ is a valid bit pattern. Moreover, it is the bit pattern representation of the negative number $m - n$, so we cnan indeed compute $m - n$ as an unsigned operation on the bitstring representations of m and n.
• Case: $m < n$. Here we observe that $m + (2^{32} - n) = 2^{32} + m - n$. Since $m - n > 0$, this is a number $> 2^{32}$ and therefore not a legitimate expression of a negative number. However, if we store this number in 33 bits, we see that it is the correct result $m - n$, plus a single bit in the 33-rd position. Thus, by performing the unsigned addition, and ignoring the overflow bit, we again get the correct result.

In both cases we conclude that we can perform subtraction by adding the bitstrings that represent the positive and negative number as unsigned integers, and ignoring overflow if it occurs.

#### Representation of real numbers

In this section we will look at how various kinds of numbers are represented in a computer, and the limitations of various schemes. The next section will then explore the ramifications of this on arithmetic involving computer numbers.

Real numbers are stored using a scheme that is analogous to what is known as ‘scientific notation’, where a number is represented as a significant and an exponent, for instance 6.022 x 1023, which has a significant 6.022 with a radix point after the first digit, and an exponent 23. This number stands for

$6.022 \cdot 20^{23} = [6 \times 10^{0} + 0 \times 10^{-2} + 2 \times 10^{-2} + 2 \times 10^{-3}] \cdot 10^{24}$

We introduce a base, a small integer number, 10 in the preceding example, and 2 in computer numbers, and write numbers in terms of it as a sum of t terms:

$x = \pm 1 \times [d_1 \beta^{0} + d_2 \beta^{-1} + d_3 \beta^{2} + ... ] \times \beta^{e} = \pm \sum_{i=1}^{t}d_i \beta^{1-i} \times \beta^{e}$

where the components are

• the sign bit: a single bit storing whether the number is positive or negative;
• $\beta$ is the base of the number system;
• $0 \leq d_i \leq \beta - 1$ the digits of the mantissa or significant – the location of the radix point (decimal point in decimal numbers) is implicitly assumed to the immediately following the first digit;
• t is the length of the mantissa;
• $e \in [L,U]$ exponent; typically $L < 0 < U$ and $L \approx - U$

Note that there is an explicit sign bit for the whole number; the sign of the exponent is handled differently. For reasons of efficiency, e is not a signed number; instead it is considered as an unsigned number in excess of a certain minimum value. For instance, the bit pattern for the number zero is interpreted as e = L.

##### Some examples

Let us look at some specific examples of floating point representations. Base 10 is the most logical choice for human consumption, but computers are binary, so base 2 predominates there. Old IBM mainframes grouped bits to make for a base 16 representation.

Of these, the single and double precision formats are by far the most common. We will discuss these in section 3.2.4 and further.

Binary coded decimal

Decimal numbers are not relevant in scientific computing, but they are useful in financial calculations, where com-putations involving money absolutely have to be exact. Binary arithmetic is at a disadvantage here, since numbers such as 1/10 are repeating fractions in binary. With a finite number of bits in the mantissa, this means that the number 1/10 can not be represented exactly in binary. For this reason, binary-coded-decimal schemes were used in old IBM mainframes, and are in fact being standardized in revisions of IEEE754 [4]; see also section 3.2.4. Few processors these days have hardware support for BCD; one example is the IBM Power6.

In BCD schemes, one or more decimal digits are encoded in a number of bits. The simplest scheme would encode the digits 0 . . . 9 in four bits. This has the advantage that in a BCD number each digit is readily identified; it has the disadvantage that about 1/3 of all bits are wasted, since 4 bits can encode the numbers 0 . . . 15. More efficient encodings would encode 0 . . . 999 in ten bits, which could in principle store the numbers 0 . . . 1023. While this is efficient in the sense that few bits are wasted, identifying individual digits in such a number takes some decoding.

Ternary computers

There have been some experiments with ternary arithmetic [2, 8, 9].

##### Limitations

Since we use only a finite number of bits to store floating point numbers, not all numbers can be represented. The ones that can not be represented fall into two categories: those that are too large or too small (in some sense), and those that fall in the gaps. Numbers can be too large or too small in the following ways.

Overflow: The largest number we can store is $(1-\beta^{-t-1})\beta^{U}$, and the smallest number (in an absolute sense) is $-(1-\beta^{-t-1})\beta^{U}$; anything larger than the former or smaller than the latter causes a condition called overflow.

Underflow: The number closest to zero is $\beta^{-t-1} \cdot \beta^{L}$. A computation that has a result less than that (in absolute value) causes a condition called underflow. In fact, most computers use normalized floating point numbers: the first digit d1 is taken to be nonzero; see section 3.2.3 for more about this. In this case, any number less than  $\beta^{-1} \cdot \beta^{L}$ causes underflow. Trying to compute a number less than that is sometimes handled by using unnormalized floating point numbers (a process known as gradual underflow), but this is typically tens or hundreds of times slower than computing with regular floating point numbers. At the time of this writing, only the IBM Power6 has hardware support for gradual underflow.

The fact that only a small number of real numbers can be represented exactly is the basis of the field of round-off error analysis. We will study this in some detail in the following sections.

##### Normalized numbers and machine precision

The general definition of floating point numbers, equation (3.1), leaves us with the problem that numbers have more than one representation. For instance, $.5 \times 10^{2} = .05 \times 10^{3}$. Since this would make computer arithmetic needlessly complicated, for instance in testing equality of numbers, we use normalized floating point numbers. A number is normalized if its first digit is nonzero. The implies that the mantissa part is $\beta > x_m \geq 1$.

A practical implication in the case of binary numbers is that the first digit is always 1, so we do not need to store it explicitly. In the IEEE 754 standard, this means that every floating point number is of the form

$1.d_1d_2...d_t \times 2^{exp}$

We can now be a bit more precise about the representation error. A machine number $\bar{x}$ is the representation for all x in an interval around it. With t digits in the mantissa, this is the interval of numbers that differ from x in the t + 1st digit. For the mantissa part we get:

$\left\{\begin{matrix} x \in [\tilde{x}, \tilde{x} + \beta^{-2}] \; \; \mathrm{truncation} \\ x \in [\tilde{x}-\frac{1}{2}\beta^{-t}, \tilde{x}+\frac{1}{2}\beta^{-t}] \: \: \mathrm{rounding} \end{matrix}\right.$

Often we are only interested in the order of magnitude of the error, and we will write $\tilde{x} = x(1+\epsilon )$, where $\left | \epsilon \right | \leq \beta^{-t}$This maximum relative error is called the machine precision, or sometimes machine epsilon. Typical values are:

$\left\{\begin{matrix} \epsilon \approx 10^{-7} \: \: \: \mathrm{32-bit\: single \, precision} \\ \epsilon \approx 10^{-16} \: \: \: \mathrm{64-bit \, double\, precision} \end{matrix}\right.$

Figure 3.1: Single precision arithmetic

Machine precision can be defined another way: is the smallest number that can be added to 1 so that 1 + has a different representation than 1. A small example shows how aligning exponents can shift a too small operand so  that it is effectively ignored in the addition operation:

Yet another way of looking at this is to observe that, in the addition $x + y$ , if the ratio of x and y is too large, the result will be identical to x.

The machine precision is the maximum attainable accuracy of computations: it does not make sense to ask for more than 6-or-so digits accuracy in single precision, or 15 in double.

Exercise 3.3. Write a small program that computes the machine epsilon. Does it make any difference if you set the compiler optimization levels low or high? Can you find other ways in which this computation goes wrong?

##### The IEEE 754 standard for floating point numbers

Some decades ago, issues like the length of the mantissa and the rounding behaviour of operations could differ between computer manufacturers, and even between models from one manufacturer. This was obviously a bad situation from a point of portability of codes and reproducibility of results. The IEE standard 7542 codified all this, for instance stipulating 24 and 53 bits for the mantissa in single and double precision arithmetic, using a storage sequence of sign bit, exponent, mantissa. This for instance facilitates comparison of numbers.

The standard also declared the rounding behaviour to be ‘exact rounding’: the result of an operation should be the rounded version of the exact result.

Above (section 3.2.2), we have seen the phenomena of overflow and underflow, that is, operations leading to un-representible numbers. There is a further exceptional situation that needs to be dealt with: what result should be returned if the program asks for illegal operations such as$\sqrt{-4}$? The IEEE 754 standard has two special quantities for this: Inf and NaN for ‘infinity’ and ‘not a number’. If NaN appears in an expression, the whole expression will evaluate to that value. The rule for computing with Inf is a bit more complicated [43].

An inventory of the meaning of all bit patterns in IEEE 754 double precision is given in figure 3.1. Note that for normalized numbers the first nonzero digit is a 1, which is not stored, so the bit pattern $d_1d_2...d_t$ is interpreted as

$1.d_1d_2...d_t$

These days, almost all processors adhere to the IEEE 754 standard, with only occasional exceptions. For instance, Nvidia Tesla GPU s are not standard-conforming in single precision. The justification for this is that double precision is the ‘scientific’ mode, while single precision is mostly likely used for graphics, where exact compliance matters less.