Floating Points
Read this article on the representation of real numbers.
In computing, floatingpoint arithmetic (FP) is arithmetic using formulaic representation of real numbers as an approximation to support a tradeoff between range and precision. For this reason, floatingpoint computation is often found in systems which include very small and very large real numbers, which require fast processing times. A number is, in general, represented approximately to a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
where significand is an integer, base is an integer greater than or equal to two, and exponent is also an integer. For example:
The term floating point refers to the fact that a number's radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated as the exponent component, and thus the floatingpoint representation can be thought of as a kind of scientific notation.
An early electromechanical programmable computer, the Z3, included floatingpoint arithmetic (replica on display at Deutsches Museum in Munich).
A floatingpoint system can be used to represent, with a fixed number of digits, numbers of different orders of magnitude: e.g. the distance between galaxies or the diameter of an atomic nucleus can be expressed with the same unit of length. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with the chosen scale.
Singleprecision floating point numbers: the green lines mark representable values.
Over the years, a variety of floatingpoint representations have been used in computers. In 1985, the IEEE 754 Standard for FloatingPoint Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.
The speed of floatingpoint operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.
A floatingpoint unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floatingpoint numbers.
Overview
Floatingpoint numbers
A number representation specifies some way of encoding a number, usually as a string of digits.
There are several mechanisms by which strings of digits can represent numbers. In common mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character (dot or comma) there. If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the righthand end of the string, next to the least significant digit. In fixedpoint systems, a position in the string is specified for the radix point. So a fixedpoint scheme might be to use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345.
In scientific notation, the given number is scaled by a power of 10, so that it lies within a certain range—typically between 1 and 10, with the radix point appearing immediately after the first digit. The scaling factor, as a power of ten, is then indicated separately at the end of the number. For example, the orbital period of Jupiter's moon Io is 152,853.5047 seconds, a value that would be represented in standardform scientific notation as 1.528535047×10^{5} seconds.
Floatingpoint representation is similar in concept to scientific notation. Logically, a floatingpoint number consists of:
 A signed (meaning positive or negative) digit string of a given length in a given base (or radix). This digit string is referred to as the significand, mantissa, or coefficient. The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit.
 A signed integer exponent (also referred to as the characteristic, or scale), which modifies the magnitude of the number.
To derive the value of the floatingpoint number, the significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.
Using base10 (the familiar decimal notation) as an example, the number 152,853.5047, which has ten decimal digits of precision, is represented as the significand 1,528,535,047 together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 10^{5} to give 1.528535047×10 ^{5}, or 152,853.5047. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.
Symbolically, this final value is:
where s is the significand (ignoring any implied decimal point), p is the precision (the number of digits in the significand), b is the base (in our example, this is the number ten), and e is the exponent.
Historically, several number bases have been used for representing floatingpoint numbers, with base two (binary) being the most common, followed by base ten (decimal floating point), and other less common varieties, such as base sixteen (hexadecimal floating point), base eight (octal floating point), base four (quaternary floating point^{[nb 3]}), base three (balanced ternary floating point) and even base 256 and base 65,536.
A floatingpoint number is a rational number, because it can be represented as one integer divided by another; for example 1.45×10^{3} is (145/100)×1000 or 145,000/100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floatingpoint number using a binary base, but 1/5 can be represented exactly using a decimal base (0.2, or 2×10^{−1}). However, 1/3 cannot be represented exactly by either binary (0.010101...) or decimal (0.333...), but in base 3, it is trivial (0.1 or 1×3^{−1}) . The occasions on which infinite expansions occur depend on the base and its prime factors.
The way in which the significand (including its sign) and exponent are stored in a computer is implementationdependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary singleprecision (32bit) floatingpoint representation, , and so the significand is a string of 24 bits. For instance, the number π's first 33 bits are:
In this binary expansion, let us denote the positions from 0 (leftmost bit, or most significant bit) to 32 (rightmost bit). The 24bit significand will stop at position 23, shown as the underlined bit 0 above. The next bit, at position 24, is called the round bit or rounding bit. It is used to round the 33bit approximation to the nearest 24bit number (there are specific rules for halfway values, which is not the case here). This bit, which is 1 in this example, is added to the integer formed by the leftmost 24 bits, yielding:
When this is stored in memory using the IEEE 754 encoding, this becomes the significand s. The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from lefttoright as follows:
where p is the precision (24 in this example), n is the position of the bit of the significand from the left (starting at 0 and finishing at 23 here) and e is the exponent (1 in this example).
It can be required that the most significant digit of the significand of a nonzero number be nonzero (except when the corresponding exponent would be smaller than the minimum one). This process is called normalization. For binary formats (which uses only the digits 0 and 1), this nonzero digit is necessarily 1. Therefore, it does not need to be represented in memory; allowing the format to have one more bit of precision. This rule is variously called the leading bit convention, the implicit bit convention, the hidden bit convention, or the assumed bit convention.
Alternatives to floatingpoint numbers
The floatingpoint representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:
 Fixedpoint representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in specialpurpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
 Logarithmic number systems (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the valuetorepresentation curve (i.e., the graph of the logarithm function) is smooth (except at 0). Conversely to floatingpoint arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The (symmetric) levelindex arithmetic (LI and SLI) of Charles Clenshaw, Frank Olver and Peter Turner is a scheme based on a generalized logarithm representation.
 Tapered floatingpoint representation, which does not appear to be used in practice.
 Where greater precision is desired, floatingpoint arithmetic can be implemented (typically in software) with variablelength significands (and sometimes exponents) that are sized depending on actual need and depending on how the calculation proceeds. This is called arbitraryprecision floatingpoint arithmetic.
 Floatingpoint expansions are another way to get a greater precision, benefiting from the floatingpoint hardware: a number is represented as an unevaluated sum of several floatingpoint numbers. An example is doubledouble arithmetic, sometimes
used for the C type
long double
.  Some simple rational numbers (e.g., 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (e.g., 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers.
 Interval arithmetic allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
 Computer algebra systems such as Mathematica, Maxima, and Maple can often handle irrational numbers like or in a completely "formal" way, without dealing with a specific encoding of the significand. Such a program can evaluate expressions like "" exactly, because it is programmed to process the underlying mathematics directly, instead of using approximate values for each intermediate calculation.
History
In 1914, Leonardo Torres y Quevedo designed an electromechanical version of Charles Babbage's Analytical Engine, and included floatingpoint arithmetic. In 1938, Konrad Zuse of Berlin completed the Z1, the first binary, programmable mechanical computer; it uses a 24bit binary floatingpoint number representation with a 7bit signed exponent, a 17bit significand (including one implicit bit), and a sign bit. The more reliable relaybased Z3, completed in 1941, has representations for both positive and negative infinities; in particular, it implements defined operations with infinity, such as , and it stops on undefined operations, such as .
Konrad Zuse, architect of the Z3 computer, which uses a 22bit binary floatingpoint representation.
Zuse also proposed, but did not complete, carefully rounded floatingpoint arithmetic that includes and NaN representations, anticipating features of the IEEE Standard by four decades. In contrast, von Neumann recommended against floatingpoint numbers for the 1951 IAS machine, arguing that fixedpoint arithmetic is preferable.
The first commercial computer with floatingpoint hardware was Zuse's Z4 computer, designed in 1942–1945. In 1946, Bell Laboratories introduced the Mark V, which implemented decimal floatingpoint numbers.
The Pilot ACE has binary floatingpoint arithmetic, and it became operational in 1950 at National Physical Laboratory, UK. Thirtythree were later sold commercially as the English Electric DEUCE. The arithmetic is actually implemented in software, but with a one megahertz clock rate, the speed of floatingpoint and fixedpoint operations in this machine were initially faster than those of many competing computers.
The massproduced IBM 704 followed in 1954; it introduced the use of a biased exponent. For many decades after that, floatingpoint hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computation" (SC) capability (see also Extensions for Scientific Computation (XSC)). It was not until the launch of the Intel i486 in 1989 that generalpurpose personal computers had floatingpoint capability in hardware as a standard feature.
The UNIVAC 1100/2200 series, introduced in 1962, supported two floatingpoint representations:
 Single precision: 36 bits, organized as a 1bit sign, an 8bit exponent, and a 27bit significand.
 Double precision: 72 bits, organized as a 1bit sign, an 11bit exponent, and a 60bit significand.
The IBM 7094, also introduced in 1962, supports singleprecision and doubleprecision representations, but with no relation to the UNIVAC's representations. Indeed, in 1964, IBM introduced hexadecimal floatingpoint representations in its System/360 mainframes; these same representations are still available for use in modern z/Architecture systems. However, in 1998, IBM included IEEEcompatible binary floatingpoint arithmetic to its mainframes; in 2005, IBM also added IEEEcompatible decimal floatingpoint arithmetic.
Initially, computers used many different representations for floatingpoint numbers. The lack of standardization at the mainframe level was an ongoing problem by the early 1970s for those writing and maintaining higherlevel source code; these manufacturer floatingpoint standards differed in the word sizes, the representations, and the rounding behavior and general accuracy of operations. Floatingpoint compatibility across multiple computing systems was in desperate need of standardization by the early 1980s, leading to the creation of the IEEE 754 standard once the 32bit (or 64bit) word had become commonplace. This standard was significantly based on a proposal from Intel, which was designing the i8087 numerical coprocessor; Motorola, which was designing the 68000 around the same time, gave significant input as well.
In 1989, mathematician and computer scientist William Kahan was honored with the Turing Award for being the primary architect behind this proposal; he was aided by his student (Jerome Coonen) and a visiting professor (Harold Stone).
Among the x86 innovations are these:
 A precisely specified floatingpoint representation at the bitstring level, so that all compliant computers interpret bit patterns the same way. This makes it possible to accurately and efficiently transfer floatingpoint numbers from one computer to another (after accounting for endianness).
 A precisely specified behavior for the arithmetic operations: A result is required to be produced as if infinitely precise arithmetic were used to yield a value that is then rounded according to specific rules. This means that a compliant computer program would always produce the same result when given a particular input, thus mitigating the almost mystical reputation that floatingpoint computation had developed for its hitherto seemingly nondeterministic behavior.
 The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and then be handled by the software in a controlled fashion.
Range of floatingpoint numbers
A floatingpoint number consists of two fixedpoint components, whose range depends exclusively on the number of bits or digits in their representation. Whereas components linearly depend on their range, the floatingpoint range linearly depends on the significand range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.
On a typical computer system, a doubleprecision (64bit) binary floatingpoint number has a coefficient of 53 bits (including 1 implied bit), an exponent of 11 bits, and 1 sign bit. Since 2^{10} = 1024, the complete range of the positive normal floatingpoint numbers in this format is from 2^{−1022} ≈ 2 × 10^{−308} to approximately 2^{1024} ≈ 2 × 10^{308}.
The number of normalized floatingpoint numbers in a system (B, P, L, U) where
 B is the base of the system,
 P is the precision of the system to P numbers,
 L is the smallest exponent representable in the system,
 and U is the largest exponent used in the system)
There is a smallest positive normalized floatingpoint number,
which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.
There is a largest floatingpoint number,
which has B − 1 as the value for each digit of the significand and the largest possible value for the exponent.
In addition, there are representable values strictly between −UFL and UFL. Namely, positive and negative zeros, as well as denormalized numbers.
IEEE 754: floating point in modern computers
Floatingpoint formats 

IEEE 754 

Other 


The IEEE standardized the computer representation for binary floatingpoint numbers in IEEE 754 (a.k.a. IEC 60559) in 1985. This first standard is followed by almost all modern machines. It was revised in 2008. IBM mainframes support IBM's own hexadecimal floating point format and IEEE 7542008 decimal floating point in addition to the IEEE 754 binary format. The Cray T90 series had an IEEE version, but the SV1 still uses Cray floatingpoint format.
The standard provides for many closely related formats, differing in only a few details. Five of these formats are called basic formats and others are termed extended formats; three of these are especially widely used in computer hardware and languages:
 Single precision, usually used to represent the "float" type in the C language family (though this is not guaranteed). This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
 Double precision, usually used to represent the "double" type in the C language family (though this is not guaranteed). This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).
 Double extended, also called "extended precision" format. This is a binary format that occupies at least 79 bits (80 if the hidden/implicit bit rule is not used) and its significand has a precision of at least 64 bits (about 19 decimal digits). A format satisfying the minimal requirements (64bit significand precision, 15bit exponent, thus fitting on 80 bits) is provided by the x86 architecture. Often on such processors, this format can be used with "long double" in the C language family (the C99 and C11 standards "IEC 60559 floatingpoint arithmetic extension Annex F" recommend the 80bit extended format to be provided as "long double" when available), though extended precision is not available with MSVC. For alignment purposes, many tools store this 80bit value in a 96bit or 128bit space. On other processors, "long double" may stand for a larger format, such as quadruple precision, or just double precision, if any form of extended precision is not available.
Increasing the precision of the floating point representation generally reduces the amount of accumulated roundoff error caused by intermediate calculations. Less common IEEE formats include:
 Quadruple precision (binary128). This is a binary format that occupies 128 bits (16 bytes) and its significand has a precision of 113 bits (about 34 decimal digits).
 Decimal double precision (decimal64) and decimal quadruple precision (decimal128) decimal floatingpoint formats. These formats, along with the decimal single precision (decimal32) format, are intended for performing decimal rounding correctly.
 Half, also called binary16, a 16bit floatingpoint value. It is being used in the NVIDIA Cg graphics language, and in the openEXR standard.
Any integer with absolute value less than 2^{24} can be exactly represented in the single precision format, and any integer with absolute value less than 2^{53} can be exactly represented in the double precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53bit integers on platforms that have double precision floats but only 32bit integers.
The standard specifies some special values, and their representation: positive infinity (+∞), negative infinity (−∞), a negative zero (−0) distinct from ordinary ("positive") zero, and "not a number" values (NaNs).
Comparison of floatingpoint numbers, as defined by the IEEE standard, is a bit different from usual integer comparison. Negative and positive zero compare equal, and every NaN compares unequal to every value, including itself. All values except NaN are strictly smaller than +∞ and strictly greater than −∞. Finite floatingpoint numbers are ordered in the same way as their values (in the set of real numbers).
Internal representation
Floatingpoint numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand or mantissa, from left to right. For the IEEE 754 binary formats (basic and extended) which have extant hardware implementations, they are apportioned as follows:
Type  Sign  Exponent  Significand field  Total bits  Exponent bias  Bits precision  Number of decimal digits 

Half (IEEE 7542008)  1  5  10  16  15  11  ~3.3 
Single  1  8  23  32  127  24  ~7.2 
Double  1  11  52  64  1023  53  ~15.9 
x86 extended precision  1  15  64  80  16383  64  ~19.2 
Quad  1  15  112  128  16383  113  ~34.0 
While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and subnormal numbers; values of all 1s are reserved for the infinities and NaNs. The exponent range for normalized numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normalized numbers exclude subnormal values, zeros, infinities, and NaNs.
In the IEEE binary interchange formats the leading 1 bit of a normalized significand is not actually stored in the computer datum. It is called the "hidden" or "implicit" bit. Because of this, single precision format actually has a significand with 24 bits of precision, double precision format has 53, and quad has 113.
For example, it was shown above that π, rounded to 24 bits of precision, has:
 sign = 0 ; e = 1 ; s = 110010010000111111011011 (including the hidden bit)
The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in single precision format as
 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB as a hexadecimal number.
An example of a layout for 32bit floating point is
and the 64 bit layout is similar.
Special values
Signed zero
In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0). In most runtime environments , positive zero is usually printed as "0" and the negative zero as "0". The two values behave as equal in numerical comparisons, but some operations return different results for +0 and −0. For instance, 1/(−0) returns negative infinity, while 1/+0 returns positive infinity (so that the identity 1/(1/±∞) = ±∞ is maintained). Other common functions with a discontinuity at x=0 which might treat +0 and −0 differently include log(x), signum(x), and the principal square root of y + xi for any negative number y. As with any approximation scheme, operations involving "negative zero" can occasionally cause confusion. For example, in IEEE 754, x = y does not always imply 1/x = 1/y, as 0 = −0 but 1/0 ≠ 1/−0.
Subnormal values fill the underflow gap with values where the absolute distance between them is the same as for adjacent values just outside the underflow gap. This is an improvement over the older practice to just have zero in the underflow gap, and where underflowing results were replaced by zero (flush to zero).
Modern floatingpoint hardware usually handles subnormal values (as well as normal values), and does not require software emulation for subnormals.
Infinities
Further information on the concept of infinite: Infinity
The infinities of the extended real number line can be represented in IEEE floatingpoint datatypes, just like ordinary floatingpoint values like 1, 1.5, etc. They are not error values in any way, though they are often (but not always, as it depends on the rounding) used as replacement values when there is an overflow. Upon a dividebyzero exception, a positive or negative infinity is returned as an exact result. An infinity can also be introduced as a numeral (like C's "INFINITY" macro, or "∞" if the programming language allows that syntax).
IEEE 754 requires infinities to be handled in a reasonable way, such as
 (+∞) + (+7) = (+∞)
 (+∞) × (−2) = (−∞)
 (+∞) × 0 = NaN – there is no meaningful thing to do
NaNs
IEEE 754 specifies a special value called "Not a Number" (NaN) to be returned as the result of certain "invalid" operations, such as 0/0, ∞×0, or sqrt(−1). In general, NaNs will be propagated i.e. most operations involving a NaN will result in a NaN, although functions that would give some defined result for any given floatingpoint value will do so for NaNs as well, e.g. NaN ^ 0 = 1. There are two kinds of NaNs: the default quiet NaNs and, optionally, signaling NaNs. A signaling NaN in any arithmetic operation (including numerical comparisons) will cause an "invalid operation" exception to be signaled.
The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type or source of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a runtime system to flag uninitialized variables, or extend the floatingpoint numbers with other special values without slowing down the computations with ordinary values, although such extensions are not common.
IEEE 754 design rationale
William Kahan. A primary architect of the Intel 80x87 floatingpoint coprocessor and IEEE 754 floatingpoint standard.
It is a common misconception that the more esoteric features of the IEEE 754 standard discussed here, such as extended formats, NaN, infinities, subnormals etc., are only of interest to numerical analysts, or for advanced numerical applications; in fact the opposite is true: these features are designed to give safe robust defaults for numerically unsophisticated programmers, in addition to supporting sophisticated numerical libraries by experts. The key designer of IEEE 754, William Kahan notes that it is incorrect to "... [deem] features of IEEE Standard 754 for Binary FloatingPoint Arithmetic that ...[are] not appreciated to be features usable by none but numerical experts. The facts are quite the opposite. In 1977 those features were designed into the Intel 8087 to serve the widest possible market... Erroranalysis tells us how to design floatingpoint arithmetic, like IEEE Standard 754, moderately tolerant of wellmeaning ignorance among programmers".
 The special values such as infinity and NaN ensure that the floatingpoint arithmetic is algebraically completed, such that every floatingpoint operation produces a welldefined result and will not—by default—throw a machine interrupt or trap. Moreover, the choices of special values returned in exceptional cases were designed to give the correct answer in many cases, e.g. continued fractions such as R(z) := 7 − 3/[z − 2 − 1/(z − 7 + 10/[z − 2 − 2/(z − 3)])] will give the correct answer in all inputs under IEEE 754 arithmetic as the potential divide by zero in e.g. R(3) = 4.6 is correctly handled as +infinity and so can be safely ignored. As noted by Kahan, the unhandled trap consecutive to a floatingpoint to 16bit integer conversion overflow that caused the loss of an Ariane 5 rocket would not have happened under the default IEEE 754 floatingpoint policy.
 Subnormal numbers ensure that for finite floatingpoint numbers x and y, x − y = 0 if and only if x = y, as expected, but which did not hold under earlier floatingpoint representations.
 On the design rationale of the x87 80bit format, Kahan notes: "This Extended format is designed to be used, with negligible loss of speed, for all but the simplest arithmetic with float and double operands. For example, it should be used for scratch variables in loops that implement recurrences like polynomial evaluation, scalar products, partial and continued fractions. It often averts premature Over/Underflow or severe local cancellation that can spoil simple algorithms". Computing intermediate results in an extended format with high precision and extended exponent has precedents in the historical practice of scientific calculation and in the design of scientific calculators e.g. HewlettPackard's financial calculators performed arithmetic and financial functions to three more significant decimals than they stored or displayed. The implementation of extended precision enabled standard elementary function libraries to be readily developed that normally gave double precision results within one unit in the last place (ULP) at high speed.
 Correct rounding of values to the nearest representable value avoids systematic biases in calculations and slows the growth of errors. Rounding ties to even removes the statistical bias that can occur in adding similar figures.
 Directed rounding was intended as an aid with checking error bounds, for instance in interval arithmetic. It is also used in the implementation of some functions.
 The mathematical basis of the operations enabled high precision multiword arithmetic subroutines to be built relatively easily.
 The single and double precision formats were designed to be easy to sort without using floatingpoint hardware. Their bits as a two'scomplement integer already sort the positives correctly, and the negatives reversed. If that integer is negative, xor with its maximum positive, and the floats are sorted as integers.
Other notable floatingpoint formats
In addition to the widely used IEEE 754 standard formats, other floatingpoint formats are used, or have been used, in certain domainspecific areas.
 The Bfloat16 format requires the same amount of memory (16 bits) as the IEEE 754 halfprecision format, but allocates 8 bits to the exponent instead of 5, thus providing the same range as a singleprecision IEEE 754 number. The tradeoff is a reduced precision, as the significand field is reduced from 10 to 7 bits. This format is mainly used in the training of machine learning models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format.
 The TensorFloat32 format provides the best of the Bfloat16 and halfprecision formats, having 8 bits of exponent as the former and 10 bits of significand field as the latter. This format was introduced by Nvidia, which provides hardware support for it in the Tensor Cores of its GPUs based on the Nvidia Ampere architecture. The drawback of this format is its total size of 19 bits, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32bit singleprecision IEEE 754 format.
Bfloat16 and TensorFloat32 formats specifications, compared with IEEE 754 halfprecision and singleprecision standard formats
Type  Sign  Exponent  Significand field  Total bits 

Halfprecision  1  5  10  16 
Bfloat16  1  8  7  16 
TensorFloat32  1  8  10  19 
Singleprecision  1  8  23  32 
Type  Sign  Exponent  Significand field  Total bits 

Halfprecision  1  5  10  16 
Bfloat16  1  8  7  16 
TensorFloat32  1  8  10  19 
Singleprecision  1  8  23  32 
Representable numbers, conversion and rounding
By their nature, all numbers expressed in floatingpoint format are rational numbers with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base10, or a terminating binary expansion in base2). Irrational numbers, such as π or √2, or nonterminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the decimal number 123456789 cannot be exactly represented if only eight decimal digits of precision are available (would be rounded to 123456790 or 123456780 where the rightmost digit 0 is not explicitly stored).
When a number is represented in some format (such as a character string) which is not a native floatingpoint representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floatingpoint format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floatingpoint number to use to represent the original value. The representation chosen will have a different value from the original, and the value thus adjusted is called the rounded value.
Whether or not a rational number has a terminating expansion depends on the base. For example, in base10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers which appear to be short and exact when written in decimal format may need to be approximated when converted to binary floatingpoint. For example, the decimal number 0.1 is not representable in binary floatingpoint of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:
 e = −4; s = 1100110011001100110011001100110011...,
where, as previously, s is the significand and e is the exponent.
When rounded to 24 bits this becomes
 e = −4; s = 110011001100110011001101,
which is actually 0.100000001490116119384765625 in decimal.
As a further example, the real number π, represented in binary as an infinite sequence of bits is
 11.0010010000111111011010101000100010000101101000110000100011010011...
but is
 11.0010010000111111011011
when approximated by rounding to a precision of 24 bits.
In binary singleprecision floatingpoint, this is represented as s = 1.10010010000111111011011 with e = 1. This has a decimal value of
 3.1415927410125732421875,
whereas a more accurate approximation of the true value of π is
 3.14159265358979323846264338327950...
The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error and is limited by the machine epsilon.
The arithmetical difference between two consecutive representable floatingpoint numbers which have the same exponent is called a unit in the last place (ULP). For example, if there is no representable number lying between the representable numbers 1.45a70c22_{hex} and 1.45a70c24_{hex}, the ULP is 2×16^{−8}, or 2^{−31}. For numbers with a base2 exponent part of 0, i.e. numbers with an absolute value higher than or equal to 1 but lower than 2, an ULP is exactly 2^{−23} or about 10^{−7} in single precision, and exactly 2^{−53} or about 10^{−16} in double precision. The mandated behavior of IEEEcompliant hardware is that the result be within onehalf of a ULP.
Rounding modes
Rounding is used when the exact result of a floatingpoint operation (or a conversion to floatingpoint format) would need more digits than there are digits in the significand. IEEE 754 requires correct rounding: that is, the rounded result is as if infinitely precise arithmetic was used to compute the value and then rounded (although in implementation only three extra bits are needed to ensure this). There are several different rounding schemes (or rounding modes). Historically, truncation was the typical approach. Since the introduction of IEEE 754, the default method (round to nearest, ties to even, sometimes called Banker's Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and gives that representation as the result.^{[nb 6]} In the case of a tie, the value that would make the significand end in an even digit is chosen. The IEEE 754 standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric (nonNaN) result. It means that the results of IEEE 754 operations are completely determined in all bits of the result, except for the representation of NaNs. ("Library" functions such as cosine and log are not mandated.)
Alternative rounding options are also available. IEEE 754 specifies the following rounding modes:
 round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
 round to nearest, where ties round away from zero (optional for binary floatingpoint and commonly used in decimal)
 round up (toward +∞; negative results thus round toward zero)
 round down (toward −∞; negative results thus round away from zero)
 round toward zero (truncation; it is similar to the common behavior of floattointeger conversions, which convert −3.9 to −3 and 3.9 to 3)
Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multiprecision floatingpoint, and interval arithmetic. The alternative rounding modes are also useful in diagnosing numerical instability: if the results of a subroutine vary substantially between rounding to + and − infinity then it is likely numerically unstable and affected by roundoff error.
Source: Wikipedia, https://en.wikipedia.org/wiki/Floatingpoint_arithmetic
This work is licensed under a Creative Commons AttributionShareAlike 3.0 License.