» Sign in

Updated for March 2010 Math Library patch (April 2010)


Compilers and math libraries provide operations on floating-point numbers for modeling real-world engineering, scientific, and business problems. A floating-point number is the computer equivalent of scientific notation. For example, a millionth is expressed in scientific notation as 1.0 × 10-6; the binary floating-point representation is 1.00001102 × 2-20, rounded to eight binary digits.

This guide describes floating-point arithmetic generally and its implementation in HP-UX on Itanium-based Integrity servers specifically. Programs that perform engineering, scientific, or business calculations depend on facilities covered in this guide. Also, many programs of a less numerical nature require some numerical computation illuminated in this guide. Simply reading in or printing out floating-point numbers may involve rounding to approximate values.

The HP floating-point model for Integrity servers is designed to enable the application programmer, who may not be a numerical expert, to code numerical calculations in a natural way and obtain useful answers and good runtime performance. It also offers expert features for fine control of numerical behavior and for achieving optimal performance.

» Detailed table of contents


»  1  Standards
»  2  Floating-point data
»  3  Floating-point arithmetic
»  4  Floating-point environment
»  5  Mathematical functions
»  6  Performance and behavior
»  7  Using wide types for robust code
»  8  Porting floating-point code
»  9  Examples

Additional information

»  Decimal floating point for
»  Links
»  References

The floating-point model represented in the HP-UX C, C++, and Fortran compilers and libraries enables straightforward porting from other platforms. It is based on the IEEE floating-point standard and supporting specification in the C99 and UNIX 03 standards, and provides other popular features that might be needed for porting.

Chapters 1-9 describe binary floating point. Annex A describes decimal floating point, a new feature introduced in the HP-UX 11i Version 3 September 2008 Update. Much of the general information in Chapters 1-9 pertains to decimal, as well as binary, floating point.

See Reference 27 regarding floating-point on Integrity OpenVMS servers. See Reference 31 regarding floating-point on Integrity NonStop servers.

1 Standards

» Return to top

Two important standards underlie the HP numerical model for Integrity servers:

  • The IEEE floating-point standard
  • The C99 language standard

The IEEE 754-1985 floating-point arithmetic standard (IEEE standard) guarantees the availability and reliability of a core set of arithmetic operations. ISO/IEC C99, which is incorporated into UNIX 03, standardizes programming language support for the features of the floating-point standard and provides an extensive library of numerical functions and complex arithmetic based on the principles of the floating-point standard.

The HP C floating-point arithmetic and math library, as represented by the following headers,

  • <complex.h>
  • <fenv.h>
  • <float.h>
  • <math.h>
  • <tgmath.h>

adhere to the C99 specification, including Annex F (IEEE standard floating-point arithmetic) and Annex G (IEEE standard-compatible complex arithmetic).

HP-UX 11i V3 Release B.11.31 is registered as conforming to the UNIX 03 Product Standard. This entails conformance of the C compiler and libraries to C99.

The ISO C++ standard, which pre-dates C99, is based on C89. The HP C++ compiler and libraries offer the functionality of C99 through extensions to the C++ standard libraries. The next revision of the C++ standard (C++0x) is expected to incorporate the C99 functionality.

The HP Fortran compiler supports the Fortran90/95 feature set. It shares options, optimizations, and math function implementation with the C/C++ system.

The HP compilers and libraries offer IEEE and C99 standards compatibility, even with high levels of optimization. They provide reliable implementation of features of the standard that historically have not been practical for serious program use (even though widely supported in hardware). Additionally, they offer expression evaluation, elementary functions, and complex arithmetic that follow the spirit of the floating-point standard. Notable features related to the floating-point standard include:

  • Well-specified methods for wide expression evaluation, with declarable evaluation types and C type-generic math functions
  • Math functions with special-case behavior similar to IEEE basic operations
  • C/C++ complex arithmetic and functions, with special-case behavior in the spirit of the IEEE standard, including imaginary types for C
  • Correctly rounded binary-decimal conversion between each of four floating-point formats and up to 36 decimal digits, sufficient to distinguish all values in the widest (quad) format
  • C/C++ functions and macros for manipulating rounding modes and exception flags
  • Pragmas and compile options to guarantee reliable rounding modes and exception flags with minimal performance impact
  • Pragmas and compile options to disallow contractions (including synthesis of a multiply and an add into a Fused Multiply Add), and C/C++ fma functions for controlled use of FMAs
  • I/O for IEEE standard special symbols: NaNs (signifying Not-a-Number) and infinities
  • C/C++ NAN and INFINITY macros that can be used for compile-time initialization
  • Maintenance of the sign of zero
»  Return to top

2 Floating-point data

2.1 Types

HP C and C++ fully support four IEEE floating-point standard (IEEE standard) types, as shown in Table 1.

» Return to Overview (top)

  2 Floating-point data
  »  2.1 Types
  »  2.2 Choosing a data type
  »  2.3 Definitions

Each type also includes IEEE standard infinities (+∞ and -∞), signed zeros (+0 and -0), and NaNs (signifying Not-a-Number). Subnormal numbers are sometimes called denorms.

The C/C++ long double type is equivalent to quad on HP-UX (on both Integrity and PA-RISC systems).

Table 1: Floating-point types
Float Double Extended Quad
bits 32 64 80 128
bytes in memory 4 8 16 16
Binary exponent range
minimum (emin) -126 -1022 -16382 -16382
maximum (emax) 127 1023 16383 16383
Significand precision
bits (p) 24 53 64 113
decimal digits 7–8 15–16 19–20 34–35
Decimal range (approx.)
maximum positive 3.4e+38 1.7e+308 1.1e+4932 1.1e+4932
minimum positive normal 1.2e-38 2.3e-308 3.4e-4932 3.4e-4932
minimum positive subnormal 1.5e-45 5.0e-324 3.7e-4951 6.5e-4966
maximum negative subnormal -1.5e-45 -5.0e-324 -3.7e-4951 -6.5e-4966
maximum negative normal -1.2e-38 -2.3e-308 -3.4e-4932 -3.4e-4932
minimum negative normal -3.4e+38 -1.7e+308 -1.1e+4932 -1.1e+4932

2.2 Choosing a data type

Choosing a data type typically involves tradeoffs among
  • precision
  • range
  • speed
  • memory use

The float and double types are almost always adequate for input and output in real world problems. The diameter of the universe has been estimated to be 1.56e11 light years, which is (only) about 1.5e39 picometers (1 picometer equals 1e–12 meters). At the other extreme, the volume of a carbon atom is approximately 1e–24 cubic centimeters.

Computing with the extra precision and range of a type wider than the type for the final result generally eliminates most accuracy problems and intermediate overflows and underflows. The use of wider types for computation is a relatively simple and effective way to improve program robustness.

On Itanium-based systems, most basic arithmetic operations on float, double, and extended values are performed in floating-point registers and are similarly fast for all types. An exception is division, which is implemented with multiple hardware instructions and is faster for narrower types. Likewise, quad arithmetic operations are implemented with multiple hardware instructions and therefore are not as fast as those of the narrow types. Functions in the libm math library are generally faster for the narrower types (see 5.4).

2.3 Definitions

A binary floating-point format is characterized by its precision, p, and exponent range, emin to emax. A floating-point number in this format is represented by its:
  • sign—0 or 1
  • exponent—an integer such that eminexponentemax
  • significand—with p bits such that 0 ≤ significand < 2

and has the value

(–1)sign × significand × 2exponent

Floating-point representations are ordinarily normalized, meaning that the significand and exponent are adjusted so that the high-order bit of the significand is 1, thus providing maximum significance. For normalized numbers:

1 ≤ significand < 2

For example, in float (single) format, the largest representable number is:

significand = 2–2–23 = 1.111111111111111111111112
exponent = 127
value = (2–2–23) × 2127 ≈ 3.403 × 1038

The smallest positive normalized number that can be represented in float format is:

significand = 1 = 1.000000000000000000000002
exponent = –126
value = 1 × 2–126 ≈ 1.175 × 10–38

For subnormal numbers, the smallest positive denormalized number that can be represented in float format is:

significand = 2–23 = 0.000000000000000000000012
exponent = –126
value = 2–23 × 2–126 ≈ 1.401 × 10–45
»  Return to top

3 Floating-point arithmetic

The IEEE floating-point standard (IEEE standard) defines these operations for all operands in each format: add, subtract, multiply, divide, square root, remainder, convert to integer in floating format, convert between floating formats, convert between floating and integer formats, convert between floating formats and decimal representations, and compare.

3.1 Floating-point vs. real arithmetic

Computer floating-point numbers model the mathematical real numbers learned in math classes (such as algebra and calculus) and form the basis for our thinking about computational problems. We suppose we are computing with real number arithmetic even though our computer’s floating-point arithmetic is strikingly different. Fortunately, the model nearly always works.

The following sections discuss floating-point arithmetic according to the IEEE standard, which Intel® Itanium® and most other processors of the past two decades support. To help deal with the discrepancies between floating-point and real arithmetic, the standard specifies infinities (+∞ and -∞), signed zeros (+0 and -0), and NaNs (signifying Not-a-Number) to complete the arithmetic: every standard operation has a well-defined result for all input operands in the format.

» Return to Overview (top)

  3 Floating-point arithmetic
  »  3.1 Floating-point vs. real arithmetic
  »  3.2 Discrete vs. continuous
  »  3.3 Inexact vs. exact
  »  3.4 Correct rounding
  »  3.5 Decimal vs. binary
  »  3.6 Finite vs. unbounded range
  »  3.7 Overflow
  »  3.8 Division-by-zero
  »  3.9 Underflow
  »  3.10 Signed zeros
  »  3.11 Invalid operations
  »  3.12 Comparisons
    »  3.12.1 Quiet comparisons
  »  3.13 Arithmetic properties
  »  3.14 Complex arithmetic
    »  3.14.1 Promotion rules
    »  3.14.2 Infinity properties
The standard also specifies signals for exceptional conditions, such as 1/0 or ∞ - ∞. By default, these signals raise flags that the program can use to detect the occurrence of an exception.

3.2 Discrete vs. continuous

Real numbers are continuous, composing an unbroken real number line. Between any two real numbers are infinitely many more. In contrast, floating-point numbers are discrete, with density depending on the particular floating-point format and varying with the magnitude of the values. The distance from a small value to a nearest neighbor is small; the distance from a large value to a nearest neighbor is large. Sufficiently large floating-point numbers are all even integers. Because the floating-point representations are binary with fixed precision, the gaps between floating-point numbers double in width from one power of 2 to the next.

floating-point numbers with only 5 bits of precision

The illustration shows floating-point numbers with only 5 bits of precision. Hence, there are 24 = 16 numbers for each binade (interval between consecutive powers of 2). The values in the floating-point types are much denser. For example, the double type has 252 = 4503599627370496 values per binade.

3.3 Inexact vs. exact

Arithmetic operations—add, subtract, multiply, and divide—on real operands have results that are real numbers (except for division by zero). One divided by three is precisely the real number denoted by 1/3 = 0.3333…. But most arithmetic operations on floating-numbers have mathematical results that fall into the gaps between the computer’s floating-point numbers, and therefore must suffer a roundoff error in order to be represented in the computer. In other words, results of computer arithmetic are generally an approximation of the mathematical result. Because most calculations involve a chain of operations, inexact intermediate results are reused with the possibility of errors accumulating. To minimize the effects of inevitable roundoff, the arithmetic is designed so that each operation incurs the smallest possible error. When roundoff does occur, the inexact exception is signaled.

3.4 Correct rounding

Correct rounding refers to the method described in the IEEE standard for its basic operations. The result of the operation is the format’s best possible approximation to what the result would be if computed exactly (with unbounded range and precision). The standard’s default rounding mode, round to-nearest, chooses the nearest format value. However, if the mathematical value is half way between two format values, it chooses the one whose low order significand bit is 0, to avoid biasing the rounding effects.

The standard also specifies three user-selectable, alternate rounding modes: round upward (toward +infinity), round downward (toward –infinity), and round toward zero (truncate). The following illustration shows a segment of the real number line that contains three consecutive float (single precision) values: 1-2-24, 1, and 1+2-23. The real number 1-2-25 is half way between 1 and the next smaller float value 1-2-24. The real number 1+2-24 is halfway between 1 and the next larger float value 1+2-23.

a segment of the real number line that contains three consecutive float (single precision) values

Table 2 shows the interval of real numbers that each rounding mode would round to 1 in the example shown above.

Table 2: Real number rounding
Rounding mode Real numbers x  that round to 1
to nearest 1–2-25 ≤ x  ≤ 1+2-24
upward 1–2-24x  ≤ 1
downward 1 ≤ x  < 1+2-23
toward zero 1 ≤ x  < 1+2-23

3.5 Decimal vs. binary

Most computers store numbers in binary format and use binary floating-point arithmetic, whereas we modern humans ordinarily represent numbers in decimal notation. Consequently, input and output of values represented with decimal notation entails roundoff errors. For example, the decimal value 0.1 (one tenth) represented in binary is a non-terminating, repeating fraction: 0.00011001100110011...2, which rounds to 0.0001100110011001100110011012 in the standard binary single-precision format. The differences between binary and decimal account for many user surprises, such as why the following program fails to compute the sum 1000.0.
      [130] % cat y.c
      int main() {
         float sum = 0, tenth = 0.1;
         int i;
         for (i=0; i<10000; i++) sum += tenth;
         printf("sum = %.1f\n", sum);
      [131] % cc y.c
      [132] % a.out
      sum = 999.9

Thinking in terms of decimal numbers, one would expect the program to compute sum to be 1000.0 exactly. However, each time through the loop, the addition of the representation of decimal 0.1 in tenth to the value in sum suffers another float-precision roundoff error; these roundoff errors accumulate so that the final binary value in sum is closer to 999.9 than to 1000.0. To minimize the effects of the discrepancy between the computer’s binary formats and external decimal representations, conversion between binary and decimal is correctly rounded.

See Annex A regarding decimal floating-point formats and arithmetic.

3.6 Finite vs. unbounded range

The real numbers are unbounded: there is no largest or smallest real number. The range of floating-point numbers is ultimately limited by the number of bits in the format allotted to represent the exponent. Table 1 shows the range limitations for the four floating-point formats.

3.7 Overflow

An operation whose finite mathematical result is too large in magnitude for the format is said to overflow. Although it is difficult to find examples of real-world data or result values beyond the range of the double type, overflows are more common in intermediate calculations. For example, computation of the vector norm sqrt(x*x + y*y) can overflow if the exponent of either x or y is merely half of the maximum exponent for the format—even though the value of the norm exceeds the format range only when both x and y are near the limit of the range.

To better handle overflow (and to better model the extended real number system), the IEEE standard augments the floating-point numbers with positive and negative infinities and specifies that an overflow exception is signaled when an overflow occurs.

Computer operations that overflow produce (in default rounding mode) an infinity result with the appropriate sign. Thus, infinities serve to represent values beyond the range that can be represented with floating point, including both extreme magnitude finite numbers and exact infinities. Operations such as 1.0/0.0 and INFINITY * 2.0 produce exact infinities. An overflow is not necessarily an error, because in some calculations an overflow may not matter. For example, 1 + 1/(x*x) will evaluate to 1 for all x whose magnitude is large enough that 1/(x*x) is too small to affect the result when added to 1.

3.8 Division-by-zero

Infinities provide natural results for divisions of non-zero numbers by zero: 1/0 = ∞, -3/0 = -∞, 5/-0 = -∞. Such operations signal the division-by-zero exception. In contrast, the indeterminate division of zero by zero (0/0) yields NaN and signals invalid (not division-by-zero).

Division-by-zero is an example of an infinite result being produced from finite operands. Math functions not covered in the IEEE standard signal division-by-zero to indicate this more general exceptional condition. For example, log(0) returns -∞ and signals division-by-zero.

3.9 Underflow

An operation whose mathematical result is too small in magnitude for the format is said to underflow. Underflow compared to overflow is of less consequence, though trickier to understand. Underflow signifies that the result may have less precision than is usual for the format or may be zero in lieu of a tiny mathematical value.

On the number line of normal floating-point numbers, notice the relatively enormous gap between 0 and 2emin (the smallest normal number).

number line of normal floating-point

There are 2p-1 (where p is the format precision) points per binade, the interval between consecutive powers of 2. Thus, the gap between 0 and 2emin is 2p-1 times larger than that between 2emin and the next larger format value. The IEEE standard fills in the gap around zero with subnormal numbers:

number line using non-normalized representations

The extra values come from using the non-normalized representations in the format:

(–1)sign * significand * 2emin, where 0 < significand < 1

Thus, IEEE arithmetic can return a subnormal value, albeit with less-than-normal precision (as some high order significand bits must be 0), instead of just ±2emin or ±0 when the mathematical result is between -2emin and 2emin. This is called gradual underflow, in contrast to flush-to-zero or sudden underflow, whereby a zero is returned for all mathematical results between -2emin and 2emin. The underflow exception signals that an underflow condition may have resulted in a result of less-than-normal precision.

IEEE gradual underflow preserves the property

if xy then xy ≠ 0

which sudden underflow does not. The following code depends on this property to avoid division by zero:

if (x != y) r = 1 / (x  y);

3.10 Signed zeros

The IEEE standard specifies both positive and negative zeros. These retain the sign after underflows and distinguish the reciprocals of positive and negative infinity. Thus, 1/(1/x) is equivalent to x if x is ±∞ or ±0.

In complex arithmetic computations, signed zeros can be used to distinguish the two sides of branch cuts and thereby enable more efficient code. For more information, see “Example: Complex Arithmetic Classes” in “How Java's Floating-Point Hurts Everyone Everywhere” by W. Kahan and J. Darcy.

The signed zeros compare equal to each other (+0.0 == –0.0). For most ordinary computation, the sign of zero can be ignored.

3.11 Invalid operations

Mathematics texts leave certain operations, such as 0/0, undefined. Sometimes this is because too many numbers would satisfy the defining property (for example, the defining property for 0/0=q is 0×q=0, which is satisfied by any real number q). Other times no numbers would satisfy the defining property (for example, a defining property for sqrt(-3)=y is y2=-3, which is not satisfied by any real number y). A mathematician encountering one of these operations in a calculation will handle the situation as a special case, as in
F(x) = sin(x) / x, if x ≠ 0
F(x) = 1, if x = 0

When a computer program encounters one of these operations, the operation must return something or else stop the program. Stopping is not a viable option for most libraries, including ones that might be used by real-time codes. Returning any ordinary number when many or none would be appropriate would make it difficult for the code to recognize the problem and do anything reasonable. The IEEE standard refers to such operations—where any numerical result would be misleading—as invalid operations and specifies that the result is a NaN, a special symbol indicating Not-a-Number. NaNs help make the floating-point number system algebraically complete: every operation has a well-defined result. Invalid operations signal the invalid exception.

Invalid operations in IEEE standard arithmetic include:

  • – ∞
  • 0×∞
  • 0/0
  • /∞
  • sqrt(x) where x < 0
  • remainder(x,y) where y is zero or x is infinite
  • conversion of an infinity, NaN, or out-of-range finite floating-point value to an integer type
  • comparison (<, <=, >, >=) where one or both operands is NaN

Note that the comparison operations cannot return a NaN result value, so the invalid signal (flag) is the only indication that program control may have been subverted by a test involving an unexpected NaN.

3.12 Comparisons

The trichotomy law for real numbers states that if a and b are real numbers, then exactly one of the following is true: a<b, a=b, or a>b. This trichotomy law would hold for floating-point numbers, including +∞ and –∞, but not when NaNs are included. If a or b or both are a NaN, then a and b are said to be unordered and a<b, a=b, and a>b are all defined to be false. Note that even a=a is false if a is a NaN. Thus, for the set of values in an IEEE floating-point type, the trichotomy law is replaced by: if a and b are values of the type, then exactly one of the following is true: a<b, a=b, a>b, or a and b are unordered.

NaNs may break codes that do not account for them; for example,

     if (a<b) { /* do something */ };
     else { /* do something else, assuming a >= b */ };

If a or b is a NaN, a<b will be false, so the else block will execute, though the assumption a>=b is false. For this reason, the operators <, <=, >, and >= signal the invalid exception when one or both operands is a NaN.

3.12.1 Quiet comparisons

The “quiet” comparison macros in the <math.h> header produce the same return value as the corresponding built-in comparison operator, but do not signal invalid if an operand is a NaN. Use the corresponding quiet-comparison macro instead of a built-in operator — for example, isless(x,y) instead of x<y — if you know that the use of the comparison accounts for NaN operands, so that an invalid signal would be inappropriate.

3.13 Arithmetic properties

In real number arithmetic, operations have properties we take for granted when thinking about numbers and performing calculations. However, for floating-point, many of these properties do not hold true, or at least require modification.

Table 3: Real vs. floating-point arithmetic
Real arithmetic Floating-point arithmetic
Closure: If a and b are real numbers then so are a + b and a × b

a / b is not defined if b=0

In IEEE standard floating-point (though not in earlier systems) every operation has a well-defined result, though typically not exactly the same as the mathematical result and maybe a special infinity or NaN symbol.

a / b is defined even if b=0 (NaN or ∞ result)

a × b = b × a
a + b = b + a


(a + b) + c = a + (b + c)
(a × b) × c = a × (b × c)

a × (b + c) = a × b + a × c


Inverse Properties:
a – b = a + (–b)
a / b = a × (1 / b)
If a – b  = 0 then a = b

Valid with gradual underflow as in IEEE standard floating-point, but not with sudden (flush-to-zero) underflow
Trichotomy: exactly one of
a < b, a = b, a > b is true
A fourth alternative is possible: a and b are unordered

3.14 Complex arithmetic

C99 introduces complex arithmetic into the C language, specifying:
  • Complex types: float complex, double complex, long double complex (HP C also provides extended complex and quad complex)
  • Imaginary types (stored like corresponding real types): float imaginary, double imaginary, long double imaginary (HP C also provides extended imaginary and quad imaginary)
  • An imaginary unit (whose square equals –1): I (I * I == –1)
  • Efficient promotion rules for binary operations containing a mix of real, complex, and imaginary operands
  • Infinity properties for basic operations, and a related performance pragma
  • A header <complex.h> defining complex, imaginary, and I and declaring complex functions
  • Special cases in the style of the IEEE standard

C99 features allow for more natural and efficient codes than is possible with more traditional complex arithmetic facilities, such as those available with C++ and Fortran.

HP C for Integrity servers fully supports the C99 specification, including the portion in Annex G (IEEE-compatible complex arithmetic).

3.14.1 Promotion rules

C99 promotion rules for binary operations do not require that real or imaginary operands be converted to complex. For example, the expression
a(x + y i) + b i

is evaluated as

(a x + a y i) + b i = ax + (ay + b) i

rather than as

(a + 0i)(x + yi) + (0 + bi) = ... = ((ax – 0y) + 0) + ((ay + 0x) + b)i.

An optimizer might remove operations involving unwanted zeros introduced by more traditional promotion rules; however, such optimizations can yield unexpected result values. For example, 0y --> 0 is problematic because 0y is NaN if y is infinite or NaN and is –0 if y is negative and finite.

3.14.2 Infinity properties

The infinity properties are as follows:

For z nonzero and finite,

  • × z = ∞,
  • × ∞ = ∞,
  • / z = ∞,
  • / 0 = ∞,
  • z / ∞ = 0,
  • 0 / ∞ = 0,
  • z / 0 = ∞.

A complex value is regarded as an infinity if either the real or the imaginary part or both are infinite (this includes a complex value for which one part is infinite and the other part is a NaN). This specification preserves the information that the magnitude is infinite, even when the direction is not meaningful. The complex absolute value function, cabs, is consistent with the definition of complex infinities in that cabs of any complex infinity, for example NaN + ∞ i, returns positive infinity.

»  Return to top

4 Floating-point environment

The floating-point environment refers to:

» Return to Overview (top)

  4 Floating-point environment
  »  4.1 Rounding modes
  »  4.2 Underflow modes
  »  4.3 Trapping modes
  »  4.4 Exception flags
  »  4.5 Math functions and the floating-point environment

The HP floating-point environment includes a rounding mode (round to-nearest, round upward, round downward, or round toward zero); an underflow mode (gradual or flush-to-zero); and a mode for each floating-point exception to disable or enable trapping when the exception occurs. Additionally, the HP floating-point environment includes exception flags (invalid, division-by-zero, overflow, underflow, and inexact) specified by the IEEE floating-point standard and supported in the C99 <fenv.h> header.

The floating-point environment is dynamic with global scope. The program can change modes at runtime and the new mode settings will affect all subsequently-executed operations. Exceptional operations (e.g. 0.0/0.0) set exception flags (at runtime), and the flags remain set until the executing program clears them (thus the flags are “sticky”). Its dynamic, global nature means that care must be taken to correctly maintain and access the floating-point environment without overly constraining optimization; these issues are discussed in Example 9.5. The default mode settings serve most codes.

4.1 Rounding mode

The basic computational model for IEEE floating-point is to compute each result as if with unbounded precision and exponent range and then coerce that mathematical result to the destination format. The rounding mode determines how that conversion to a final result is done. For more information, see 3.4.

In all rounding modes, the sign of the final result retains the sign of the mathematical result. Thus, if the coercion to the destination format underflows to zero, the sign is preserved by the sign of zero.

By default (unless explicitly changed), rounding is to-nearest. The final result is the value in the destination format that is closest to the mathematical result. If two format values are equal distance to the mathematical result, then the coercion chooses the one whose least significant bit is 0. If the magnitude of the mathematical result is at least 2emax × (2 ‑ 2p), which is the largest format value plus half a unit in the least significant bit, then the result is an infinity with the sign of the mathematical result.

In round upward mode, the final result is the value in the destination format closest to, but not less than, the mathematical result. If the mathematical result is greater than the largest format value, then the final result is +∞.

In round downward mode, the final result is the value in the destination format closest to, but not greater than, the mathematical result. If the mathematical result is less than the least format value, then the final result is -∞.

In round toward-zero mode, the final result is the value in the destination format closest to, and not greater in magnitude than, the mathematical result.

4.2 Underflow mode

By default, underflow is gradual, as specified by the floating-point standard.

HP provides the alternate sudden (flush-to-zero) underflow mode. This mode significantly improves performance on Itanium-based systems for some codes, typically single-precision ones.

4.3 Trapping modes

By default, trapping is disabled for all floating-point exceptions.

HP provides the alternative of enabling trapping for each floating-point exception. Unless a trap handler is provided, a trapped floating-point exception will terminate the program.

4.4 Exception flags

Each floating-point exception (invalid, division-by-zero, overflow, underflow, inexact) has its own flag. The flag values are non-zero (set) or zero (clear), indicating that the exception has or has not occurred, respectively.

When an exception occurs and trapping on the exception is not enabled (the default), the operation sets the exception flag. Once set, the flag remains set unless or until the program explicitly clears it; thus, the flag indicates whether an exception has occurred since the flag was last cleared (not whether the most recent operation was exceptional).

Functions in the <fenv.h> header clear, test, save, and restore exception flags, individually or collectively.

4.5 Math functions and the floating-point environment

Most of the math library functions—including exponential, logarithm, and trigonometric functions—are not directly covered by the IEEE floating-point standard. The C99 standard specifies exceptions for these functions following the principles of the floating-point standard. (For more information, see 5.7). However, the C99 standard does not specify the effects of alternate rounding modes on these functions. HP math functions follow the C99 standard. Additionally, most HP math functions reflect the rounding mode as shown below:
f(x) rounded down ≤ f(x) rounded to nearest ≤ f(x) rounded up
»  Return to top

5 Mathematical functions

The HP libm library for Integrity servers provides mathematical functions for C, C++, and Fortran90. In concert with the compilers, it provides a leading combination of functionality, quality, and performance. With inlining and software pipelining, commonly-used math functions can achieve throughput comparable to hand-tuned vector routines without requiring user code to be written for a vector interface, and with no loss of accuracy or edge-case behavior; for example, the single-precision exponential can exceed 400 million evaluations per second on a 1.5 GHz Itanium system. The math interface for application programming (“API”) encompasses C99, X/Open, and other popular functionality and fully supports four IEEE floating types.

Many of the techniques and algorithms used in the HP implementation of libm math functions for Integrity servers are described in IA-64 and Elementary Functions, Speed and Precision, by Peter Markstein.

» Return to Overview (top)

  5 Mathematical functions
  »  5.1 Usage
  »  5.2 Quality dimensions
  »  5.3 Functionality
    »  5.3.1 PA-RISC compatibility
  »  5.4 Speed
    »  5.4.1 Latency
    »  5.4.2 Throughput with inlining
  »  5.5 Accuracy
  »  5.6 Standards
  »  5.7 Special cases

For more information about implementing elementary functions, including correctly rounded ones, see Elementary Functions: Algorithms and Implementation, by Jean-Michel Muller.

HP also offers another high-performance math library, MLIB, which contains routines for common math algorithms (not in libm). MLIB has these six components:

  • VECLIB—basic linear algebra (BLAS) and discrete Fourier transforms
  • LAPACK—linear algebra package for the direct solutions of dense systems of equations
  • ScaLAPACK—redesigned LAPACK for distributed-memory parallel computers
  • SOLVERS—direct solvers for sparse linear systems, including graph-partitioning routines
  • VMATH—vector math routines corresponding to widely used libm functions
  • SuperLU-DIST—direct solvers for sparse linear systems for distributed-memory parallel computers

Many of the techniques and algorithms used in the implementation of MLIB are described in Software Optimization for High Performance Computing, by Kevin Wadleigh and Isom Crawford.

The remainder of this chapter covers the libm math library. For more information about MLIB, see http://www.hp.com/go/mlib.

5.1 Usage

In C and C++ programs, include the headers for the libm math library functionality that is used. The C language permits declaring standard functions directly without including the header; however, this practice may result in substantially reduced performance, because the headers contain macros and pragmas that enable special optimizations, such as inlining.

When linking with cc or ld, add the -lm option to link in libm. Linking with aCC or f90 will add the -lm option automatically. The libm library is available as both shared (dynamic) and archive (static) libraries. The default is to link with a shared library. To link with an archive libm, use the -l:libm.a option, instead of -lm.

A correct version of libm for the runtime model (ILP32 or LP64) will be chosen automatically, based on the +DD option or default (+DD32 on HP-UX).

The default versions of the math functions, as well as alternate versions that the compiler may invoke under certain options (for example, +Ofltacc=relaxed), are all in the same libm libraries. There is no need to link with alternate libm libraries.

Some math library patches introduce new functions. If an application uses new functions in a patch and is linked with the shared libm library (which is the default for -lm), then any system the application runs on must have the same or a later Math Library Core patch, or have a later OS version. If the application is linked with the archive libm library (for example, with -l:libm.a), then a system does not require such a patch or later OS version in order to run the application.

See A.3 regarding usage of decimal floating point.

5.2 Quality dimensions

The HP libm math library is designed to excel in all quality dimensions:

The remainder of this section describes the libm library (including sequences that the compilers inline) in each of these dimensions. For more information about performance and behavior, see Chapter 6.

5.3 Functionality

The HP libm library provides the C functions—for each of the float, double, extended, and quad floating-point types — as shown in Table 4:

Table 4: HP libm library and the C functions
Function/macro family C double-precision functions
exponentials exp, exp2, exp10, expm1
logarithms log, log2, log10, log1p
trigonometrics cos, sin, tan, cot, sincos
inverse trigonometrics acos, asin, atan, atan2
trigs with degree arguments cosd, sind, tand, cotd, sincosd
inv trigs with degree outputs acosd, asind, atand, atan2d
hyperbolics cosh, sinh, tanh, sinhcosh
inverse hyperbolics acosh, asinh, atanh
square & cube roots, reciprocal square root, hypotenuse sqrt, cbrt, rsqrt, hypot, invsqrt
floating multiply-add fma
power pow/pown/powlln
financial/growth compound, annuity
exponent/significand logb, ilogb, frexp, significand
scaling scalbn, scalbln, scalb, ldexp
rounding to integral floating ceil, floor, trunc, round
IEEE rounding to integral floating rint/nearbyint
“Fortran” rounding to integer lround, llround
IEEE rounding to integer lrint, llrint
integral and fractional parts modf
remainder/mod remainder, remquo, fmod, drem
error functions erf/erfc
gamma functions lgamma/lgamma_r, tgamma, gamma_r
next value nextafter/nexttoward
absolute value, copy sign fabs, copysign
max, min, positive difference fmax, fmin, fdim
Bessel functions of 1st & 2nd kind j0/j1/jn, y0/y1/yn
not-a-number nan
classification isnan, isinf, isfinite, isnormal, signbit,fpclassify, finite
“quiet” comparison isless, islessequal, isgreater, isgreaterequal, isunordered, islessgreater
complex exponential cexp, cexp2, cexp10
complex log clog, clog2, clog10
complex trigs ccos, csin, ctan, cis, cisd
complex inverse trigs cacos, casin, catan
complex hyperbolics ccosh, csinh, ctanh
complex inverse hyperbolics cacosh, casinh, catanh
complex magnitude & modulus cabs, carg
complex square root and power csqrt, cpow
complex parts, conjugate, projection conj, creal, cimag, cproj
exception flags feclearexcept, fetestexcept, feraiseexcept
opaque exception flag state fegetexceptflag, fesetexceptflag
rounding direction modes fegetround, fesetround
FP environment as a whole fegetenv, fesetenv
hiding exceptions feholdexcept, feupdateenv
flush-to-zero mode fegetflushtozero, fesetflushtozero
trap enablement fegettrapenable, fesettrapenable

The extended and quad types are extensions to the C/C++ programming languages. They are defined by the inclusion of one or more of the C headers <math.h>, <float.h>, <stdlib.h>, and <complex.h>. To avoid conflicts in existing programs that use the names for other purposes, the definitions of the extended and quad types, as well as all the functions and macros involving the types, become visible only if the -fpwidetypes compile option is used. (The standard long double nomenclature is always visible.) Illustrating the nomenclature, the <math.h> header declares the following base-e exponential functions:

     float expf(float);
     double exp(double);
     long double expl(long double);
     #ifdef _FPWIDETYPES // defined by -fpwidetypes option
         extended expw(extended);
         quad expq(quad);

The <cmath>, <complex>, and <limits> C++ headers provide almost all the functions available with the C headers, with overloaded functions for the extended and quad types. The libm library supplies the underlying implementation for the C++ math functions in these headers.

The Fortran90 compiler uses libm functions for its standard math intrinsics, but does not yet extend the set of intrinsics to all the libm functions and does not support the 80-bit extended type. Example 9.6 shows how to call C functions from Fortran.

The sinhcosh, cot, and cotd functions and float versions of the Bessel functions were added in 11.23 patch PHSS_29678 (and all subsequent Math patches and releases for Integrity servers).

The isnan, finite, drem, invsqrt, gamma_r, significand, cisd, clog2, clog10, cexp2, and cexp10 functions and long double, extended, and quad versions of the Bessel functions, j0, j1, jn, y0, y1, and yn, were added in 11.23 patch PHSS_36344 and 11.31 patch PHSS_36351.

Some of these new interfaces provide functionality already supported in HP-UX under a standard or preferred interface, but were added to facilitate importing legacy code developed on other platforms, particularly with GNU and Intel tool sets:

Table 5: Preferred Interfaces
Legacy Preferred
isnan (function) isnan (generic macro)
finite (function) isfinite (generic macro)
drem remainder
invsqrt rsqrt
gamma_r lgamma_r

The HP-UX <cmath> C++ header does not provide overloaded functions for the legacy interfaces listed in Table 5.

5.3.1 PA-RISC compatibility

The API for C math functions for Integrity servers is a major superset of that for PA-RISC.
  • It has a complete set of functions of types long double, quad, and extended, none of which are available for PA-RISC.
  • It has a complete set of float functions; about half of them are available for PA-RISC.
  • It has these additional functions, not available for PA-RISC, in all floating-point precisions:
    annuity cotd powlln significand tgamma
    cis exp10 pown sincos  
    compound fma rsqrt sincosd  
    cot nexttoward scalbln sinhcosh  
  • It has the legacy interfaces in Table 5, which are not available for PA-RISC.
  • It has complex and imaginary functionality, including that specified in C99. The C complex and imaginary types are not supported for PA-RISC.
  • It provides the C99 <tgmath.h> header for type-generic math functions, which is not available for PA-RISC.

For more information about math functions for PA-RISC, see Reference 18.

5.4 Speed

Two measurements are important for assessing math function performance: latency and throughput. Here latency refers to the time required for a library function, invoked by one external call, to deliver its result. Throughput refers to how many evaluations of a function can be completed in a given period of time (or, equivalently, the average number of cycles per evaluation) when the function is called many times in a loop. For throughput, function evaluations may execute concurrently, depending on the capabilities of the compiler and hardware.

HP compilers can inline most of the commonly used math functions (approximately 300 different sequences) and can software pipeline loops, exploiting key features of Itanium processors, to achieve speedups of 4×–10× versus sequential external calls. For details, see Reference 23.

Intel Itanium processors can issue two bundles of three instructions each per cycle, with up to two floating-point instructions per cycle. Latency is typically one cycle for integer instructions and four cycles for floating-point instructions. Full pipelining allows floating-point instructions started in consecutive cycles to execute in parallel (assuming no data dependencies). An inlined sequence of floating-point instructions typically leaves slots available for integer calculation, memory access, and so on. Also, latency between dependent floating-point instructions allows other floating-point calculations to be scheduled for concurrent execution.

Software pipelined loops showcase the benefits of inlining math functions. The Itanium features—predication, rotating registers, loop and epilog count registers, and loop branch types—enable the compiler to schedule multiple loop iterations to execute in parallel.

HP algorithms for math functions strike a balance between providing low latency and being amenable to software pipelining (see Reference 19). The measurements of latency in Table 6 and throughput in Table 8 are achieved with strictly equivalent algorithms; thus, results are identical whether a function is called conventionally or inlined and software pipelined by the compiler. Under +Ofltacc=relaxed, as in Table 7 and Table 9, results of library calls and inlined sequences may differ slightly. Measurements shown in this section are for ILP32 data mode; measurements for LP64 data mode are generally the same.

5.4.1 Latency

The performance of each function in the libm library depends in part on its algorithmic complexity, argument and result precision, and the compiler optimization level under which it is called. For optimization levels below +O2, most libm functions are invoked via call branches to the highly accurate and robust default implementations, which have been optimized for source arguments of usual values (not edge or special cases).

Table 6 displays the minimum latency of such calls for the various precisions of common elementary functions in B.11.23 patch PHSS_40539 and B.11.31 patch PHSS_40540. The data reflect the time from each function call to when the result is available to the caller, assuming that the called function has source operands immediately available for its use in the source registers. Latency units are machine cycles. The tabulated performance levels are reached when the implementations are cache-resident (both instructions and data).

Table 6: Latency cycles on Intel Itanium processors
Function float double extended quad
fabs 6 6 6 3
sqrt 30 41 45 111
exp 36 44 57 258
log 32 35 57 237
cos 49 53 74 288
sin 49 53 73 316
tan 70 81 114 329
acos 30 63 93 316
asin 30 63 93 319
atan 44 53 62 285
atan2 53 68 80 300
cosh 50 48 69 241
sinh 48 61 69 265
tanh 58 75 94 391
rsqrt 26 31 43 117
pow 62 74 105 702

Note that the latency of 6 shown in Table 6 for fabs includes a 2-cycle overhead for the call. The compiler normally replaces a call to fabs with an inlined fmerge instruction (provided the <math.h> header is included), so the latency for fabs is actually only 4.

With +Ofltacc=relaxed, the compiler may invoke faster and slightly less accurate implementations for math functions. Table 7 shows latency measurements.

Table 7: Latency cycles on Intel Itanium processors with +Ofltacc=relaxed
Function float double extended
sqrt 22 37 37
exp 36 39 44
log 32 32 35
cos, sin 49 53 57
tan 70 81 93
acos, asin 29 39 63
cosh 50 47 48
sinh 48 47 63
tanh 58 64 84
rsqrt 22 31 31
pow 59 67 92

5.4.2 Throughput with inlining

The compiler may inline the implementations of common math functions into the caller's code, provided the header that declares the function is included, thus enabling opportunities for performance optimization to exploit the parallelism of Itanium systems. At all optimization levels, the compilers inline very simple functions, such as fabs and copysign. At +O2, they may inline most common functions, including ones shown in Table 8 and Table 9, depending on the compiler's inlining heuristics. At +O3, they may inline several more functions and use more aggressive inlining heuristics. If the inlined function lies within a loop construct, the compiler may be able to software pipeline the loop body, thus affording much greater throughput of function results than that achievable by call branches to function implementations (see also 6.5.3). For example, an inlined expf function in a loop is capable of delivering a result every 3.6 cycles on the average, as opposed to every 36 cycles if the closed routine is called. Inlining and software pipelining in simple loops generally achieve throughput 4×–10× greater than with external calls to libm.

In Table 8 and Table 9, the throughput numbers (cycles per function output) are obtained from executing a loop that evaluates the function on an input vector of length 1024 and stores the outputs to another vector of the same length. The entries of the input vector are selected in such a way that any of the rare cases, e.g. invalid input arguments, or arguments at which the function overflows or underflows, or arguments of huge magnitudes for trigonometric functions, are excluded. The same loop is executed a few times and the minimum number of cycles of a particular loop execution is taken and divided by 1024, the vector length, to arrive at cycles per function output—the throughput number. This number may not be achieved when data are not cache resident. In some cases, the compiler unrolls the body of the loop 2, 4, or even 8 times.

Table 8: Throughput cycles on Intel Itanium processors
Function float double extended
sqrt 5.05 6.64 7.57
exp 3.57 5.58 6.72
log 5.08 6.60 14.12
cos 5.60 7.10 13.10
sin 5.60 7.10 14.09
tan 11.12 13.12 28.15
acos 11.58 11.10 27.12
asin 11.58 11.60 27.15
atan 14.08 18.14 23.15
rsqrt 4.08 6.11 8.09
rint/trunc/AINT 1.57 1.56 2.06
round/ANINT 2.54 2.55 2.56
ceil/floor 2.54 2.55 2.54
NINT 2.04 2.10 N/A

Table 9: Throughput cycles on Intel Itanium processors with +Ofltacc=relaxed
Function float double extended
sqrt 3.03 5.55 5.55
exp 3.57 4.18 5.58
log 4.90 6.08 6.61
cos 5.60 7.10 8.08
sin 5.60 7.10 8.08
tan 11.12 13.12 16.11
acos 9.06 11.10 11.10
asin 9.57 11.60 11.61
atan 8.38 11.13 23.15
rsqrt 3.04 5.05 5.56
pow 10.17 13.30 27.23

It is anticipated that future releases will perform even better due to ongoing performance tuning of implementation algorithms, advances in compiler technology, and faster versions of more functions for use under +Ofltacc=relaxed.

5.5 Accuracy

The HP libm library for Integrity servers has been carefully crafted to provide as much accuracy as possible commensurate with reasonably high performance. Accuracy for each elementary function is quantified by the largest rounding error (relative to the infinitely precise result), when rounding to nearest, measured in units in the last place (ulp). A one ulp rounding error for float precision signifies a relative rounding error of about 2-24 up to 2-23. Correct rounding in IEEE round-to-nearest mode implies a rounding error of at most 0.5 ulp.

Table 10 shows accuracy bounds for libm real functions under the default (IEEE round-to-nearest) rounding mode:

Table 10: Accuracy in ULPS
type default relaxed
float 0.502 1
double 0.502 2
extended 0.550 4
quad 0.502 N/A

All of the functions meet these error bounds thoughout their domain, except for gamma, Bessel, and extended and quad error functions (erf and erfc).

Table 11 displays bounds for observed errors measured in ulps for various elementary functions, with round-to-nearest rounding mode. The measurements reflect the worst observed error for each function in millions of test cases distributed throughout the function domain.

Table 11: Bounds for observed errors in ULPS
Function family float double extended quad
exponentials .50005 .5004 .53 .5001
logarithms .5002 .501 .506 .5001
trigonometrics .50002 .502 .501 .5002
inverse trigonometrics .502 .502 .53 .5002
hyperbolics .5003 .502 .53 .502
inverse hyperbolics .5003 .502 .5004 .502
cube root, hypotenuse .50002 .501 .502 .5005
reciprocal square root .5005 .501 .500001 .50005
powers .50005 .502 .502 .50004
annuity, compound .50005 .502 .53 .5002

Other libm functions, such as square root and functions that convert to integer formats or integral values, round correctly as specified by either the IEEE standard or the function definitions.

With +Ofltacc=relaxed, the compiler may invoke faster and slightly less accurate implementations for math functions (for more information, see 5.4). Table 12 shows accuracy measurements with +Ofltacc=relaxed. Some entries—for example the float trigonometrics—are identical to the corresponding ones in Table 11 because the compiler does not select a special version for that precision of the function (though it might in the future).

Table 12: Bounds for observed errors in ULPS with +Ofltacc=relaxed
Function family float double extended
square root .9 .500001 .7
exponentials .50005 1.8 1.9
logarithms .57 .65 3.3
trigonometrics .50002 .502 3.4
inverse trigonometrics 1.0 1.0 3.7
hyperbolics .5003 1.8 3.5
inverse hyperbolics .75 1.4 2.5
cube root .50002 .501 2.5
reciprocal square root .7 1.5 2.9
powers .5001 1.5 3.5

5.6 Standards

The HP libm library was designed to support the IEEE floating-point standard and the C99 language standard, including Annexes F and G and Technical Corrigenda 1 and 2.

Additionally, the HP libm library includes all the math functions required by UNIX 98. The +Olibmerrno compile option (implied by -AC89 and -Aa and the default with c89) causes invocation of versions of the math functions that conform to the C89 and UNIX 98 standards. Under +Olibmerrno, <math.h> functions (and C++ <cmath> functions) set errno according to requirements of these standards and to C99's optional errno specification for <math.h> functions, and they return special-case results according to UNIX 98 instead of C99 (Annex F), in the few cases where the standards differ. UNIX 03 aligns with C99.

5.7 Special cases

The HP libm library for Integrity servers carefully adheres to the detailed specification of special cases for <math.h> and <complex.h> functions in Annexes F and G of the C99 standard, which incorporates and extends the IEEE standard treatment of special cases.

Cases are termed “special” because they do not have a clearly best result. Both the IEEE standard and C99 specify that very problematic cases, for example 0.0/0.0, for which any result would be generally misleading, are to return NaN and raise the invalid exception. For other special cases, C99 specifies returning a numerical result, instead of NaN, if that result is useful in some important applications, especially if the diagnostic value of returning a NaN would be dubious. An example is the specification for 00 = 1, which is discussed in “Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit”, by W. Kahan.

Careful treatment of special cases by the libm library and the compiler make program behavior more predictable, which facilitates writing robust, efficient code that handles special cases gracefully. It also makes debugging easier.

»  Return to top

6 Floating-point performance and behavior

HP compilers and libm library are designed to provide the best performance possible, while delivering a very high level of overall quality, suitable for a broad range of programs. Overall quality entails predictability, accuracy, and careful treatment of special cases, all of which can affect the correctness and robustness of a program. Often, highest quality comes with at least some performance penalty. However, many important applications need all the speed they can get and are tolerant of less rigorous floating-point behavior, while others need even stricter behavior than is guaranteed by default. To accommodate different needs of different applications, HP compilers provide controls in the form of options and pragmas, thus allowing programmers who are not satisfied with the default settings to make intelligible tradeoffs between floating-point performance and behavior. The controls are of three sorts:

6.1 Quick-pick floating-point options

The flowchart in 6.1 leads you through a series of questions to quickly determine which floating-point options will suit your needs. The remaining sections in the chapter explain the options in more detail.

number line of normal floating-point

» Return to Overview (top)

  6 Performance and behavior
  »  6.1 Quick-pick floating-point options
  »  6.2 General controls
  »  6.3 Controls for floating-point functionality
    »  6.3.1 Using extended and quad types
    »  6.3.2 Using floating-point modes and flags
    »  6.3.3 Controlling contractions
    »  6.3.4 Using errno
  »  6.4 Controls for tradeoff between speed and behavior
    »  6.4.1 Value-changing optimization
    »  6.4.2 Flush-to-zero vs. gradual underflow
    »  6.4.3 Limited range for complex operations
    »  6.4.4 Fast and faster options
  »  6.5 Other techniques for performance
    »  6.5.1 Inline assembly
    »  6.5.2 Estimated frequency pragmas
    »  6.5.3 Avoiding C/C++ aliasing problems
    »  6.5.4 Optimizing sum reductions
number line of normal floating-point
number line of normal floating-point
number line of normal floating-point
number line of normal floating-point

*Some compile options have corresponding pragmas that have the same effect. Whereas an option affects the entire compilation unit, the pragma may be applied to particular blocks of code. The pragma may be used just on the code that needs it, without affecting performance or behavior of other code (see Table 13). Beginning with the AR0606 release, +Ocxlimitedrange is the C/C++ default under +Ofltacc=relaxed (and not under any other +Ofltacc setting). See 6.4.1.

Table 13: Options and corresponding pragmas
Option Corresponding pragma
+Ofenvaccess #pragma STDC FENV_ACCESS ON
+Ocxlimitedrange #pragma STDC CX_LIMITED_RANGE ON
+Ofltacc=strict #pragma STDC FP_CONTRACT OFF

6.2 General controls for optimization

General optimization controls include optimization level options (for example, +O3), profiling, link options (for example, -Bprotected), and user assertions (for example, +Onoptrs_to_globals), among others. Many of these options improve performance of floating-point code.

At the +O2 optimization level, the compiler optimizes most libm functions (ones that do not read or write global variables or objects pointed to by the function arguments) like built-in floating-point operators. Also, at this optimization level, the compiler may inline most of the commonly used libm functions, including logarithm, exponential, and trigonometric functions. Inlining can result in particularly dramatic speedup where the function call appears in a loop that the compiler can software pipeline (see 5.4.2).

At the +O3 optimization level, the compiler inlines even more libm functions (approximately 300 different sequences) and performs advanced loop optimization, which greatly improves performance of many floating-point codes.

The general optimization controls have no negative effect on documented floating-point behavior. Hence, these options can be chosen based on other considerations, such as compile time, without concern about sacrificing correctness or robustness of floating-point code.

6.3 Controls for special floating-point functionality

6.3.1 Using extended and quad types

The extended and quad types are provided as extensions to the C and C++ programming languages. They are defined by the inclusion of one or more of the headers <math.h>, <float.h>, <stdlib.h>, and <complex.h>. To avoid conflicts in existing programs that use the names for other purposes, the definitions of the extended and quad types, as well as all the functions and macros involving the types, become visible only if the -fpwidetypes compile option is used. (The standard long double nomenclature is always visible.)

6.3.2 Using floating-point modes and flags

Application code that runs in non-default floating-point modes (for example, a non-default rounding direction) or tests exception flags must either be compiled with the +Ofenvaccess option or else be under the effect of a

directive (specified in C99), for reliable behavior. Without one of these controls, the compiler might undermine expected behavior by moving or removing code, for example moving the multiply in

     #include <fenv.h>
     a = b*c;

to before the fesetround call, or even to compile time if the compiler can determine the values of b and c.

For best performance, #pragma STDC FENV_ACCESS ON should be placed in the smallest blocks that enclose the code needing protection.

6.3.3 Controlling contractions (including FMA synthesis)

Application code that is not tolerant of contractions (which the compiler may synthesize by default) should be compiled with the +Ofltacc=strict option or be under the effect of a

directive (specified in C99). Contractions are optimizations that combine multiple language-level operations into just one machine instruction. For example, in

     float r;
     double x, y, z;
     r = x * y + z;

the multiply, add, and assignment to float can be contracted into one fma.s instruction (the .s completer indicates rounding to single). Contractions are acceptable in most applications, and generally even improve accuracy. However, they can break codes that depend on operations being rounded to specific range and precision, and rarely even ones that do not. In the example of computing log(ab + cd), use of FMA in the calculation of a*b + c*d can yield a negative value, even though ab + cd > 0 if evaluated exactly. This will cause an invalid exception and NaN result from the log function.

On the other hand, controlled use of FMA can enable the design of much more efficient codes for some problems. FMA offers the obvious advantage of computing two operations in the time of one and with just one rounding error. Also, it is particularly effective for capturing extra bits of a calculation.

The following examples illustrate the use of FMA:

» Computing multiplication error
» Computing exy accurately and efficiently

6.3.4 Using errno

Codes that depend on <math.h> functions setting errno must be compiled with the +Olibmerrno option. Use of the +Olibmerrno option degrades performance of libm functions that may as a result of the option set errno or return results that differ from the C99 specification.

The IEEE exception flag mechanism functionally supersedes the errno mechanism for math functions. Four exception flags—invalid, divide-by-zero, overflow, and underflow—provide finer identification of the special cases that errno classifies as domain or range errors. Each exception has its own flag, which is not overwritten when a different exception occurs, whereas the errno value is overwritten on various “error” conditions, including ones unrelated to floating-point. Exceptions are raised consistently by both math functions and also built-in floating-point operations, whereas only library functions set errno. The hardware exception flags can be accessed with less restriction on optimization than can the global errno variable. The +O[no]libmerrno option does not affect errno setting by non-math functions.

6.4 Controls for tradeoff between speed and floating-point behavior

Other controls, principally the +Ofltacc=strict|default|limited|relaxed compile option, are designed specifically to allow tradeoff between performance and floating-point behavior. Such options should be considered for performance-hungry applications that are known to be tolerant of a less rigorous floating-point model or can be thoroughly tested.

6.4.1 Value-changing optimization

+Ofltacc=default (the default) disallows optimizations that might change result values from those the programming language specifies, except that it allows contractions (including FMA synthesis). This option balances speed and behavior considerations and is suitable for most applications.

+Ofltacc=strict disallows all optimizations that might change result values, including contractions (notably FMA synthesis). This option provides the most predictable floating-point behavior, while admitting compatible optimizations, but loses the performance benefits of contractions. This option is equivalent to the legacy +Ofltacc option.

+Ofltacc=limited is equivalent to the default, except that infinities, NaNs, and the sign of zero may not be treated as specified. This option might provide some minor speedup for applications that do not depend on special values. Note that careful treatment of special values may improve robustness and facilitate debugging, even if the program is not specifically coded with special values in mind.

+Ofltacc=relaxed allows code transformations based on mathematical identities for real numbers, even if the transformations may change result values. With this option, the compiler may invoke slightly less accurate versions of math functions (see 5.4 and 5.5) and may inline math function sequences that yield slightly different result values than their counterparts in the libm library. This option delivers the best performance. Beginning with the AR0606 release, the +Ofltacc=relaxed option implies +Ocxlimitedrange, unless the +Onocxlimitedrange option or #pragma STDC CX_LIMITED_RANGE OFF directive is used. The legacy +Onofltacc option has the same effect as +Ofltacc=relaxed.

6.4.2 Flush-to-zero vs. gradual underflow

The +FPD option (for a link or compile command that produces an executable) causes program startup to install flush-to-zero underflow mode, instead of the default gradual underflow mode. In flush-to-zero mode, floating-point instructions return zero in lieu of non-zero results with magnitude less than 2emin. Some codes are more robust when run with gradual underflow. If tiny values will not be encountered or not be significant in the calculation or logic, then flush-to-zero is fine.

On current Itanium-based systems, use of flush-to-zero mode significantly improves performance of some single-precision calculations; it rarely affects the performance of calculations in wider precision.

6.4.3 Limited range for complex operations

Implementations of complex multiply, divide, and absolute value by the usual textbook formulas
(a+bi)(c+di) = (ac–bd) + (bc+ad)i
(a+bi)/(c+di) = ((ac+bd) + (bc–ad)i) / (c2 + d2)
|a+bi| = sqrt(a2 + b2)

are prone to intermediate over/underflow and tend to return NaN results rather than potentially useful values. The a2+ b2 in the formula for the complex absolute value overflows if the exponent of either a or b is more than half the maximum exponent for the evaluation type. Using the division formula, (1+0i) / (∞ +∞ i) yields NaN+NaN i, whereas a zero would be better. However, producing more reasonable results for these cases slows down the operations, even if the troublesome cases are not encountered.

The compile option +O[no]cxlimitedrange and the C99 directive


allow trading off careful behavior of the complex multiply and divide operations and cabs functions for performance.

C99 provides the CX_LIMITED_RANGE ON pragma to inform the compiler that the application code does not depend on operands beyond those that could be handled by the textbook formulas. This allows the compiler to deploy faster code for these operations than would be possible if the full range of inputs must be handled. With HP compilers and libm on Integrity servers, these performance controls are safe for float complex and double complex multiply, divide, and absolute value operations with extreme finite operands, because the implementation avoids intermediate overflow and underflow by using wide evaluation internally.

With the +Onocxlimitedrange compile option and the CX_LIMITED_RANGE OFF pragma, the complex multiply, divide, or absolute value operations avoid spurious overflow and underflow and honor the infinity properties specified in C99.

6.4.4 Fast and faster options

The +Ofast (or -fast) and +Ofaster options imply the +Ofltacc=relaxed and +FPd options, among others not specific to floating-point. The floating-point effects of these options can be overridden: for example, the combination +Ofast +FPd +Ofltacc=default specifies the +Ofast set of options, except with gradual (instead of flush-to-zero) underflow and with default instead of relaxed floating-point optimization.

6.5 Other techniques for performance

6.5.1 Inline assembly

HP C and C++ compilers support function-like intrinsics that can be used to inline Itanium instructions into the code stream. This feature is called inline assembly (356.0K, PDF). Inline assembly offers the direct system accessibility of assembly language programming without the burdens of scheduling, register allocation, and predicate management, which are particularly daunting for Itanium. Itanium offers several features designed for floating-point efficiency that are not accessible through high-level language operations and libraries, but are through inline assembly. For example, the FMA instructions take three operands, each in any of the three Itanium-supported floating-point types (float, double, extended), and another operand that determines the type for the rounded result. In the following code, the entire computation of res is effected with one fma instruction—without requiring additional instructions to convert the arguments or the result. The instruction reads the float, double, and extended arguments directly from the floating-point registers and delivers a result rounded to double (precision and exponent range) into the destination floating-point register.
     #include <machine/sys/inline.h>
     float f;
     double d, res;
     extended w;
     res = _Asm_fma(_PC_D, f, d, w);

The implementation of the libm library makes extensive use of inline assembly, and achieves performance comparable to hand-tuned assembly language code.

Programming with inline assembly can be tricky, because certain assumptions valid with standard programming languages do not apply. Also, of course, code containing inline assembly is not portable to other architectures. Programmers who are very familiar with the Itanium architecture might consider inline assembly for performance critical codes intended specifically for Integrity servers.

6.5.2 Estimated frequency pragmas

Numerical codes commonly test for cases that can reasonably be assumed to be rare, such as inputs with extreme or unintended values. The Estimated_Frequency pragma can be used to provide the compiler with an estimate of the relative execution frequency—as a decimal fraction between 0.0 and 1.0—of a conditional block of code, compared to the immediately surrounding block. The compiler uses this information to optimize for paths through the code that the pragma has indicated will be frequent. Opportune use of the pragma for code that is very frequent or very infrequent can improve performance significantly. In the following, the pragma estimates that the code for the usual cases (where x is finite) will be executed 9999 times out of 10000:
     if (isfinite(x)) {
          #pragma Estimated_Frequency 0.9999
          /* code for usual cases */
     else /* x is infinite or NaN */  { 
          /* code for rare cases */

Equivalently, the code for the rare cases could have been estimated at 0.0001.

Also, the Estimated_Frequency pragma (with an integral estimate) can be placed in the body of a loop to estimate the trip count for the loop, for example

     for (...) {
         #pragma Estimated_Frequency 1000

6.5.3 Avoiding C/C++ aliasing problems

The capability of HP compilers to inline math functions and to software pipeline loops makes it easy to implement fast vector functions. For example, the following code implements a vector version of the fmax function:

     #include <math.h>
     void vfmax(int n, const double * a, const double * b, double * y) {
          int i;
          for (i=0; i<n; i++) *y++ = fmax(*a++, *b++);

Given no additional information (and without inter-procedural optimization enabled), the compiler is severely limited in its ability to software pipeline, because it must assume that a, b, and y might point to overlapping storage; hence, it must take care to assure that loads through a and b are not affected by any preceding stores through y. If you inform the compiler that a, b, and y will not point to overlapping storage, the compiler can produce much more efficient code.

With HP C/C++ compilers, you can use the +Onoparmsoverlap option to declare that function parameters do not point to overlapping storage.

Alternatively, you can use the C99 restrict type-qualifier for this purpose, as shown in the following version of code:

     #include <math.h>
     void vfmax(int n, const double restrict * a,
                const double * restrict b, double * restrict y) {
          int i;
          for (i=0; i<n; i++) *y++ = fmax(*a++, *b++);

Compilers that do not yet support the C99 restrict keyword may provide similar functionality under another name. For example, the HP C compiler does support C99 restrict, but also provides __restrict, even in non-C99 modes, and can handle the code above if invoked with the -Drestrict=__restrict option.

6.5.4 Optimizing sum reductions

By default, the HP C/C++ compilers do not re-associate floating-point operations. However, the stakes for re-associating are particularly high in the case of loop summations, called “sum reductions”. For example:
     sum = 0;
     for (i=0; i<n; i++) sum = sum + x[i];

The source code specifies the following calculation order:

     ...(((x[0] + x[1]) + x[2]) + x[3]) + ...

Doing the calculation this way might be crucial, for example, if same-sign values of x had been ordered from least to greatest magnitude to minimize accumulation of rounding errors. However, more often the values of x are not specially ordered, nor does the programmer intend any particular order of summation. The compiler can greatly speed up the calculation of sum reductions by grouping the terms and summing the groups in parallel (tree fashion).

When AR0709 and later compilers re-associate terms in a float (single precision) or double sum reduction, they evaluate partial sums to the wider range and precision of the 80-bit extended format, before delivering the final sum in float or double. This greatly reduces variations due to reordering.

HP compilers offer the +Osumreduction option to allow reassociation for computing sum reductions specifically, while not licensing the compiler to re-associate generally . The +Osumreduction option is in effect by default with the HP Fortran compiler. (The Fortran default can be overridden with +Onosumreduction.)

»  Return to top

7 Using wide types for robust code

HP's floating-point facilities on Integrity servers enable programming techniques for simpler, more robust code. Describing code as more robust means that it delivers useful results for a broader range of inputs.

7.1 Why floating-point calculations fail

Results from floating-point calculations may not be useful because the input is too close to where the problem, or the method chosen to solve the problem, is mathematically ill-posed. A general rule applicable in various problem areas is

error ≈ (precision roundoff) / (distance from ill-posed)

(This rule is discussed in W. Kahan's paper Marketing versus Mathematics (104.0KB, PDF) in the section entitled "How do Errors get Amplified? By Singularities Near the Data...".)

» Return to Overview (top)

  7 Using wide types for robust code
  »  7.1 Why floating-point calculations fail
  »  7.2 Wide types help
  »  7.3 Extended and quad nomenclatures
  »  7.4 Option for automatic wide evaluation
  »  7.5 Evaluation types: float_t and double_t
  »  7.6 Type-generic math functions

Proportionality to the precision roundoff implies that using greater precision in the calculation reduces the error.

Also, calculations may fail because of intermediate overflow or underflow, though the desired result would be within range. In these cases, overflow and underflow might be avoided if intermediate calculations use greater exponent range.

7.2 Wide types help

Using a wider type for internal calculations is often an easy and effective strategy for improving robustness. The alternative of careful error analysis might show problems in the calculation that could be circumvented without (or with more local use of) wide precision; however, error analysis for even modest calculations typically proves daunting, even for numerical analysts.

To help with wide evaluation, HP offers two wide types, extended and quad, each with more range and precision than the float (single) and double types typically used for input data and results (see 2.1). The extended type has 11 more bits of precision than double, which generally reduces the number of failures due to precision roundoff by a factor of roughly 2000. The quad type has still 49 more bits of precision than extended. Both extended and quad provide 4 more exponent bits than double, increasing the exponent range by a factor of 16, which is usually sufficient to eliminate overflow and underflow in calculations on float or double inputs. The extended type is particularly attractive in this context because extended arithmetic on Itanium systems is as fast as float or double. Even quad might be considered for performance-sensitive applications that need more precision than extended for a small number of operations and functions. Elementary functions for extended are about 0.7 times as fast as corresponding ones for double when both are calls to libm, rather than inlined. Elementary functions for quad are about 0.25 times as fast as corresponding ones for extended.

Wide floating-point types are a notable advantage for Itanium-based systems over most RISC systems. RISC systems typically have no hardware support for types wider than double. They may provide the IEEE quad type implemented in software, or they may provide a non-IEEE, double-double type, using two double representations for the head and tail of a value. The double-double type is problematic because its exponent range is no wider than double, its precision depends on the value being represented, and regular treatment for edge cases is difficult. In contrast, Itanium-based systems have full hardware support for the extended type, and have features that have been used to implement substantially faster IEEE quad functions than RISC systems can offer.

7.3 Extended and quad nomenclatures

HP C/C++ on Integrity servers provide several features that facilitate using the wide types. First, they support nomenclature for the various programming constructs involving extended and quad. The following prints the extended and quad logarithms of 1/3 of the maximum extended and quad values:
     cc -fpwidetypes
     #include <stdio.h>
     #include <math.h>
     #include <float.h>
     int main ( ) {
         printf("%.20hLe\n", logw((1.0W/3.0W) * EXT_MAX));
         printf("%.35lLe\n", logq((1.0Q/3.0Q) * QUAD_MAX));

The code illustrates C I/O specifiers, function suffixes, constant suffixes, and macros for the extended and quad types. Note that use of the nomenclature requires the -fpwidetypes option.

7.4 Option for automatic wide evaluation

The C/C++ compile option

causes evaluation of narrower precision binary operations and floating constants to the type specified in the option.

  • -fpeval=float—binary operations and floating constants are evaluated to their semantic type
  • -fpeval=double—binary operations and floating constants of type float are evaluated to double
  • -fpeval=extended—binary operations and floating constants of type float or double are evaluated to extended

The evaluation methods for -fpeval=float and -fpeval=double adhere to the C99 specification (corresponding to FLT_EVAL_METHOD = 0 and FLT_EVAL_METHOD = 1, respectively). The evaluation method for -fpeval=extended follows the C99 specification for evaluation to long double, except that evaluation is to extended instead of long double. HP’s default is -fpeval=float (consistent with PA-RISC compilers).

The irregular use of wide types by some compilers (for example, coercing wide representations when registers are “spilled”) has unfairly blemished the reputation of wide evaluation. HP compilers do wide evaluation based on the well-defined methods in C99. Results are just as predictable with wide evaluation as with any other method.

Note: HP C/aC++ Patches PHSS_39823 (for 11i v2) and PHSS_39824 (for 11i v3), June 2009, provided some important defect repairs for automatic wide evaluation. (See HP-UX Patch data base.)

7.5 Evaluation types: float_t and double_t

The <math.h> header defines the C99 types float_t and double_t, which are the types used for evaluating float and double operations and floating constants. For example, by default float_t is float and double_t is double, but under -fpeval=extended both are extended. The float_t and double_t types can be used for declaring intermediate variables that retain the extra range and precision in wide-evaluated expressions (see Example 9.2.2).

7.6 Type-generic math functions

When the C99 <tgmath.h> header is included, an unsuffixed math function takes on a type determined by its arguments. The mechanism supports all floating types. This feature is similar to C++ overloading and Fortran intrinsics, but restricted to system library functions of floating type. For example:
     cc -fpwidetypes
     #include <tgmath.h>
     extended x, y;
     float complex z, r;
     y = log(x); // equivalent to logw(x)
     r = log(z); // equivalent to clogf(z)

Example 9.2 illustrates the use of wide types and their supporting features to improve robustness.

»  Return to top

8 Porting floating-point code

The floating-point programming model is not the same on all systems. Programs that compile and run correctly on one system may fail to compile or may produce different results on another. The most common reasons related to floating-point are differences in

The HP floating-point programming model for Integrity servers is designed to facilitate importing code—most code will port easily and will work at least as well as on other systems.

8.1 Expression evaluation

The “expression evaluation method” refers to the way the compiler determines the format to which floating-point operations are rounded.

» Return to Overview (top)

  8 Porting floating-point code
  »  8.1 Expression evaluation
  »  8.2 Optimization
    »  8.2.1 Contractions
    »  8.2.2 Re-association
    »  8.2.3 Replacing division by multiplication of the reciprocal
    »  8.2.4 Distributing and factoring
    »  8.2.5 Use of faster but less accurate sequences for built-in functions
    »  8.2.6 Flush-to-zero underflow
  »  8.3 Floating-point formats
  »  8.4 Functionality
  »  8.5 Treatment of special cases

Some compilers evaluate each floating-point operation based strictly on the types of the operands. This method is sometimes referred to as “to-type”. C99 implementations that define FLT_EVAL_METHOD to be 0 use this method.

Other compilers evaluate floating-point expressions in types that have more exponent range and/or precision than the operands. C99 specifies two, fully predictable, wide evaluation methods, which the implementation announces by defining FLT_EVAL_METHOD to be 1 or 2, according to whether its minimal evaluation type is double or long double. (Historically, use of wide types has been irregular and unpredictable from the program source, with narrowing conversions occurring simply because the compiler runs out of registers.)

The usual effect of differences in expression evaluation is that floating-point results differ in a few low-order bits. These minor deviations in results are usually not significant, though they commonly cause last-digit differences in printed output. When the method of expression evaluation does matter, wide evaluation is more forgiving. Its extra precision may avoid troublesome accumulation of roundoff errors, particularly in numerically unstable code. Its extra exponent range may avoid intermediate overflow and underflow—calculations with the single-precision type are most susceptible. Thus, programs compiled with wide evaluation and used successfully may fail when ported to a system whose compiler uses narrower evaluation. Porting from narrow to wider evaluation is less risky (ignoring historical irregular use of wide types).

Though uncommon, some codes depend on rounding to a particular precision and may fail totally if operations are rounded to a precision other than the one intended—even if rounded to more precision. Ideally, such code should contain explicit constructs to assure the requisite rounding regardless of the expression evaluation method.

HP C/C++ compilers provide three expression evaluation methods that help with porting risks. To minimize expression evaluation problems when porting code, choose an evaluation width on the port-to system that matches or exceeds the evaluation width used on the port-from system. For example, -fpeval=extended is a good choice for code ported to HP Integrity servers from IA32 systems where compilers commonly evalute to the width, or at least the exponent range, of the IA32 floating-point registers (which use the 80-bit IEEE extended format).

8.2 Optimization

Some optimization of floating-point code may change results from what the source code and underlying computer arithmetic would imply. Examples include:

Most compilers perform result-changing optimizations, at least under some options. When porting code, refrain from enabling (non-default) value-changing optimizations until the code is up and running correctly on the new system. A more cautious strategy is to disable all value-changing optimizations for this initial stage of the port.

8.2.1 Contractions

Contractions alter results by omitting rounding errors. They are the only value-changing optimizations that C99 permits and the only ones that HP C and C++ compilers do by default. They are suitable for most applications. However, contractions do alter results in ways not predictable from the source code and different compilers use different methods to decide what to contract. Therefore, contractions are not suitable where last-bit reproducibility or specified rounding is strictly required.

8.2.2 Re-association

If re-association is allowed, the compiler may replace (a + b) + c with a + (b + c) and (a * b) * c with a * (b * c) where more efficient object code might result. Note that C and C++ specify addition and multiplication to be “left associative” so that a + b + c (without parentheses) is equivalent to (a + b) + c. It is easy to find examples where the associative properties fail for floating-point numbers and its application subverts some useful coding techniques. For example, if a exceeds b in magnitude, then the rounding error e in a + b can be computed as
     s = a + b;
     e = (s – a) – b;

provided the compiler does not re-associate the calculation of e and compute s - (a + b), which would evaluate to zero (instead of the rounding error).

Most codes are written without concern for the subtleties for floating-point rounding and so (a + b) + c is as likely to be useful as a + (b + c), even though the two may well differ in their low-order bits.

HP compilers may re-associate under the (non-default) +Ofltacc=relaxed option. See 6.5.4 regarding re-association in the computation of sum reductions.

8.2.3 Replacing division by multiplication of the reciprocal

The expressions a/b and a*(1/b) are not quite equivalent for floating-point numbers. Multiplication is faster than division on Itanium (and many other) processors. Thus, replacing a/b with a*(1/b) can speed up code if b is loop-invariant so that 1/b can be computed outside the loop, or if b is a constant and 1/b can be computed at compile time.

HP compilers may replace division by multiplication of the reciprocal under the +Ofltacc=relaxed option. They can achieve much of the benefit of this optimization even under +Ofltacc=strict (which disallows all value-changing optimizations) because the Itanium instruction set facilitates fully correcting the optimized result with only a modest cost in performance.

8.2.4 Distributing and factoring

The distributive law a*(b + c) == a*b + a*c does not hold true for floating-point arithmetic. Particularly bad is the case where b and c are of nearly equal magnitudes but with opposite signs. Then the replacement of a*(b + c) by a*b + a*c replaces a cancellation involving exact values b and c, which will be exact, with cancellation of inexact values a*b and a*c, which may lose precision. For example, if double precision a = 3, b = 252 + 1, and c = -252, then a*(b + c) evaluates to 3 exactly; however, a × b is 3 × 252 + 3, which rounds to 3 × 252 + 4 in double precision, and so a*b + a*c evaluates to 4—a 33% error.

HP compilers may replace a*b + a*c by a*(b + c) under the +Ofltacc=relaxed option.

8.2.5 Use of faster but less accurate sequences for built-in functions

Most system math functions are not correctly rounded for all domain values; their results vary from system to system, a little or a lot depending on the quality of the implementations. HP’s and other very high-quality implementations are nearly correctly rounded: they achieve the ideal result in most cases and miss only where the mathematical result is close to half way between the ideal and delivered results. Two such implementations will produce mostly the same results, else results that differ only in the low-order bit.

HP compilers may invoke faster, slightly less accurate math function under the +Ofltacc=relaxed option. Even the “relaxed” functions are accurate to common industry practice.

8.2.6 Flush-to-zero underflow

Intel Itanium processor systems, among others, require extra processing time to handle subnormal values in the manner prescribed by the IEEE floating-point standard. Therefore, they provide a mode that trades off the benefits of gradual underflow in order to improve performance by “flushing” subnormal result values to zero.

Occasionally, a program that runs correctly with gradual underflow fails under flush-to-zero. More rarely does a program’s correctness depend on having flush-to-zero behavior. Also, flush-to-zero modes do not behave the same on all systems—some flush both subnormal operands and subnormal results to zero and others flush only results.

HP-UX provides gradual underflow by default, a linker option +FPD to install flush-to-zero mode at program startup, and an <fenv.h> function fesetflushtozero to set flush-to-zero mode at execution time. Flush-to-zero mode on Itanium systems flushes subnormal results but not operands to zero. On PA-RISC systems, it flushes subnormal operands, as well as results, to zero.

8.3 Floating-point formats

Today, most implementations use the IEEE standard single and double formats for the standard float (real) and double (real*8) types. However, differences abound in the support of wider floating-point types. The C/C++ long double format is required only to be at least as wide as the double one. Formats in use for long double include:
  • 128-bit IEEE double extended
  • 80-bit IEEE double extended
  • 64-bit IEEE double
  • “double-double”—a head-tail representation using a pair of IEEE doubles

Faithful porting is easiest between systems using the same formats for the same types. Generally, having port-to formats that are at least as wide as the port-from formats helps preserve correctness. However, performance-sensitive code that uses the long double type where just a little extra precision or range is needed might be better served by faster 80-bit arithmetic than slower 128-bit arithmetic.

Of course, a port-to platform that uses the double format for the long double type will not be adequate for code that depends on long double having extra precision or range.

The (non-IEEE) double-double format provides more precision than double, but is problematic because it has no extra exponent range beyond double, it has no fixed precision, and it does not support regular treatment of special cases.

HP-UX uses a 128-bit IEEE double-extended format for long double, which allows importing long double code without loss of precision or exponent range. The HP C/C++ compilers for Integrity servers also fully support both 80- and 128-bit types under the names extended and quad.

8.4 Functionality

Not all systems offer the same functionality. Missing types or interfaces usually cause compiler diagnostics when programs are ported. Behavior differences between the port-from and port-to platforms may cause more subtle problems that can be detected only through extensive testing or use.

Standards help. Older standards such as C89, UNIX 95, and Fortran90 are widely implemented. The C99/UNIX 03 standard offers extensive functionality for numerical programming, including support for the IEEE floating-point standard, and is mostly—if not entirely—available for major systems.

HP-UX for Integrity servers provides the standard math-related interfaces as described in Chapter 1, as well as other frequently used (non-standard) math functions.

8.5 Treatment of special cases

Special cases (for example, division by zero, overflow, and so forth) are inherent in floating-point arithmetic and robust code must handle them in a reasonable way—deliver a useful result or an intelligible diagnostic. The IEEE standard provides special values (infinity and NaN) and exception flags (invalid, overflow, underflow, and division-by-zero) to help deal with special cases. However, historically, programming language and library support for these features has been irregular, incomplete, and inconsistent from one system to the next. Thus programs, particularly ones that need to be ported, have been unable to use the features.

The C99/UNIX 03 support for the IEEE standard, including the features for handling special cases, is now becoming widely available. HP C compilers and libraries fully adhere to the C99 specification.

With the +Olibmerrno option, HP math functions set errno according to the UNIX 95 standard and the optional errno specification in C99. Additionally, they return special-case results according to UNIX 95 instead of C99, in the few cases where the standards differ. Because of the performance costs, this option should be used only if truly required.

»  Return to top

9 Examples

9.1 Example: Using wide accumulators

Wide accumulators are effective for improving accuracy and avoiding intermediate overflow and underflow. Following are three coding approaches to using wide accumulators that have different implications for compatibility, quality, and performance.

9.1.1 Using automatic wide evaluation

The following code computes the double precision dot product of vectors x and y:
     #include <math.h>
     double ddot(int n, const double * x, 
                  const double * y) {
          double_t sum = 0;
          int i;
          for (i=0; i<n; i++) 
               sum += (*x++) * (*y++);
          return sum;

If expressions are automatically evaluated to the 80-bit extended format, as under HP’s “-fpeval=extended -fpwidetypes” options, then double_t is extended (per C99), and the extra 11 bits of significance in extended vs. double will reduce the accumulated roundoff error by a factor of about 2000—a factor of 2 for each extra bit.

The following computes the double-precision 2-norm of the vector x:

     #include <tgmath.h>
     double dnorm2(int n, const double * x) {
          double_t sum = 0;
          int i;
          for (i=0; i <n; i++, x++) 
               sum += (*x) * (*x);
          return sqrt(sum);

If expressions are automatically evaluated to the 80-bit extended format, as under HP’s “-fpeval=extended -fpwidetype” options, then double_t is extended (per C99), the type-generic sqrt is actually the extended sqrtw, and the extra exponent range of extended versus double essentially eliminates the possibility of intermediate overflow or underflow. Without wide evaluation, this code will fail if all x components are too small (so their squares underflow significantly) or at least one is too large (so its square overflows).

» Return to Overview (top)

  9 Examples
  »  9.1 Using wide accumulators
    »  9.1.1 Using automatic wide evaluation
    »  9.1.2 Using long double
    »  9.1.3 Using extended
    »  9.1.4 Sum reduction
  »  9.2 Using wide types for robust log(ab+cd)
    »  9.2.1 Wide expression evaluation
    »  9.2.2 Wide local variables, type-generic math functions
  »  9.3 Using FMA to compute multiplication error
    »  9.3.1 Computing multiplication error
    »  9.3.2 Computing exy accurately and efficiently
  »  9.4 Double rounding error
  »  9.5 Managing the floating-point environment
    »  9.5.1 Ignoring the floating-point environment
    »  9.5.2 Guarding against alternate modes, signaling deserved exceptions
    »  9.5.3 Honoring the floating-point environment
  »  9.6 Calling C library functions from Fortran
  »  9.7 Using C99 complex arithmetic
  »  9.8 Precise representation for floating-point constants and data
    »  9.8.1 Hexadecimal floating-point representation
    »  9.8.2 Correct rounding between binary floating types and decimal character sequences
    »  9.8.3 Wide evaluation effects on floating decimal constants
    »  9.8.4 Constants for mathematical expressions
  »  9.9 Technique for assessing rounding-off

On Itanium systems, arithmetic with extended is essentially as fast as with double; therefore, with HP compilers for Integrity servers, any performance cost for the quality benefits of using extended is negligible in these examples. Moreover, without wide evaluation, substantially more complicated code is required for comparable quality. Robust dnorm2 code for a system that does not support a wider type typically involves scaling.

The codes above will compile and run on any C99 implementation. However, the described benefits depend on evaluation to a format with more precision and range than double, and C99 does not require that implementations provide such evaluation.

9.1.2 Using long double

The following codes use the long double type explicitly for internal computation; hence, their quality does not rely on the implementation to provide wide evaluation, but only on long double being wider than double. The performance cost is significant on HP-UX and other systems that implement long double as quad.
     double ddot(int n, const double * x, const double * y) {
          long double sum = 0;
          int i;
          for (i=0; i<n; i++) sum += (long double)(*x++) * (*y++);
          return sum;
     #include <math.h>
     double dnorm2(int n, const double * x) {
          long double sum = 0;
          int i;
          for (i=0; i<n; i++, x++) sum += (long double)(*x) * (*x);
          return sqrtl(sum);

9.1.3 Using extended

For HP and other systems that support the same nomenclature for fast extended types, the codes can be written explicitly for extended evaluation, thus assuring both quality and performance, without relying on implicit wide evaluation.
     #include <float.h>
     double ddot(int n, const double * x, const double * y) {
          extended sum = 0;
          int i;
          for (i=0; i<n; i++) sum += (extended)(*x++) * (*y++);
          return sum;
     #include <math.h>
     double dnorm2(int n, const double * x) {
          extended sum = 0;
          int i;
          for (i=0; i<n; i++, x++) sum += (extended)(*x) * (*x);
          return sqrtw(sum);

With HP compilers, these codes must be compiled with the +fpwidetypes option and ddot must include <float.h> or another header that defines the extended type. Table 14 summarizes the three approaches:

Table 14: Comparison of approaches for wide accumulation
Approach Portability Quality High performance
Using automatic wide evaluation Yes (assuming C99) Yes for HP—requires support for evaluation wider than double Yes for HP—requires fast evaluation type
Using long double Yes Yes for HP—requires long double being wider than double No for HP—requires fast long double
Using extended Requires extended nomenclature Yes Yes for HP—requires fast extended

9.1.4 Sum reduction

The +Osumreduction option is appropriate for the examples above (unless the data has been ordered to control roundoff or last-bit reproducibility is required) and improves performance substantially by allowing the compiler to evaluate parts of the sum concurrently. Wide evaluation hides most of the result variability from summing in different orders.

9.2 Example: Using wide types for robust log(ab + cd)

The problem is to calculate log(ab  + cd) where double is sufficient to represent the inputs a, b, c, and d, and ab cd is known to be non-negative. Here is a straightforward solution using double:
     #include <math.h>
     double a, b, c, d, res;
     double s;
     s = a*b + c*d;
     res = log(s);

However, this code is problematic. First, a log function is necessarily unstable where a*b + c*d is inexact and near 1. For example, the calculation loses all significant bits if a*b = 1 and |c*d| < 2-53, because then a*b + c*d yields 1 and log of 1 returns 0, whereas log(1 + x) is approximately x when x is small.

Also, a*b + c*d can lose all significant bits from cancellation if a*b and c*d are nearly negatives of each other and one or both are inexact.

Finally, the computation of a*b + c*d can overflow or underflow the double format, though the true value of log(ab + cd) is always safely in the double range.

9.2.1 Wide expression evaluation

Simply compiling with -fpeval=extended makes the code in 9.2 somewhat more robust. The effect is that a*b + c*d is computed in extended, retaining 11 extra bits in the products, which greatly reduces the number of possible inputs that would cause harmful cancellation. However, the assignment to s still yields a double value. The log arguments near 1 are not represented with any extra precision. And, the extended sum a*b + c*d can overflow or underflow when coerced to the range of double.

9.2.2 Wide local variables and type-generic math functions

The following code to calculate log(ab + cd) addresses the three robustness problems noted in 9.2:
     cc -fpwidetypes -fpeval=extended ...
     #include <tgmath.h>
     double a, b, c, d, res;
     double_t s;
     s = a*b + c*d;
     res = log(s);

Here the -fpeval=extended option causes double_t to be defined as extended and the inclusion of <tgmath.h> causes the reference to log to be type generic. Thus, s is computed fully in extended, the extended log function is used, and coercion to double occurs only in the final assignment to double. With this version, the likelihood of encountering each of the two precision problems has been reduced by a factor of roughly 2000, and the risk of intermediate overflow and underflow has been completely eliminated.

Note that writing

     res = log(a*b + c*d);

does not suffice for fully wide calculation, because the semantic type of the argument, not its evaluation format, determines the type of the generic log function. (This specification was chosen for C99 so that enabling wide evaluation would not automatically incur the extra performance cost of wider library functions.)

9.3 Example: Using FMA to simplify algorithms

9.3.1 Computing multiplication error

Fused Multiply-Add can be used to compute the exact roundoff error e in the product x * y:
     p = x * y;
     e = fma(x, y, –p);

Note that the error cannot be computed reliably with

     e = x * y – p;

Under +Ofltacc=default and more lenient options, the compiler is allowed, but not required, to synthesize FMAs. The compiler might evalute x * y - p with a multiply and a subtract, instead of with an FMA, which would always produce zero instead of the error in p.

On HP systems, the fma function (for float, double, and extended) is implemented with an inlined Intel Itanium Floating-point Multiply Add instruction. The fma function, together with the facility to disable compiler syntheses of FMA, provide the control needed for algorithm development (see 6.3.3). The compiler may generate a Floating-point Multiply Subtract instruction for fma(x, y, –p). Likewise, the compiler may generate a Floating-point Negative Multiply Add instruction for a call such as fma(-x, y, z).

9.3.2 Computing exy accurately and efficiently

This example shows the analysis behind the use of FMA to compute exy to extended precision, accurately and efficiently, without use of a wider type. Consider this straightforward program:
     extended x, y, r;
     r = expw(x*y);

The relative error in the exponential function is proportional to the absolute error in the argument. The rounded result of x*y can be in error by as much as about 2-50 (which is a unit in the last place of 214, the largest value for which expw does not overflow). An absolute error of 2–50 in x*y induces a relative error of about 2–50 in expw. Thus, the low order 14 bits of the 64-bit result significand may be corrupted.

The product xy can be written as the exact sum high + low, where high is the computed x*y and low is the error, which fmaw(x,y,-x*y) computes exactly. Because

elow = 1 + low + low2/2 + ..., where |low| < 2-50

we have the approximation

exy = ehigh+low = ehigh×elow ≈ ehigh×(1+low) = ehigh+low×ehigh

Computation of ehigh produces, at most, just slightly more than .5 ulp error and the multiply-add produces, at most, .5 ulp error. Therefore, the following code calculates exy with, at most, just slightly more than 1 ulp error.

     #include <math.h>
     extended x, y, r, high, low;
     high = x*y;
     low = fmaw(x,y,–high);
     r = expw(high);
     r = fmaw(r,low,r);

9.4 Example: Double rounding error

The function

     float ssqf (float x, float y) { 
         return (double)x * x + (double)y * y; 

discussed in 9.5.3 does not always round correctly. The double format is enough wider than float that both products are exact, but the sum can produce a double precision rounding error and the conversion to float can produce a second rounding error. In most cases, the double precision rounding error is inconsequential to the float result, but not always. For example, if x = 1 + 2-12 and y = 2-27, then ssqf returns the value 1 + 2-11, though the correctly rounded float sum of squares would be 1 + 2-11 + 2-23. An error incurred as a result of rounding to a wider format and then rounding again to a narrower format is sometimes referred to as a “double rounding error.” The following uses an HP inline assembly intrinsic to round the sum directly to single precision, thereby avoiding the double rounding error and producing a correctly rounded result:

     #include <machine/sys/inline.h> 
     float crssqf (float x, float y) {
          return _Asm_fadd(_PC_S, (double)x * x, (double)y * y); 

Either implementation could use FMA to save an operation, though this would have no affect on the numerical result, since the products are computed exactly with or without FMA. Here is an implementation equivalent to ssqf but using the double fma function:

     #include <math.h>
     float ssqf (float x, float y) { 
          return fma(x, x, (double)y * y); 

9.5 Example: Managing the floating-point environment

The dynamic, global nature of the control modes in the floating-point environment is potentially hazardous in that a function might be called with different modes in effect than the function was designed and tested for. The dynamic, global nature of the exception flags is potentially confusing (or worse) because exceptions that may be spurious byproducts of a function’s internal calculations are globally visible to the function’s callers. For HP and other systems that support trap enable modes, the management of modes and the management of flags are closely related, because exception signals are affected by the modes and affect the flags.

To help deal with the hazards of the floating-point environment, C99 specifies the following programming conventions:

  1. A function call does not alter its caller’s floating-point control modes, clear its caller’s floating-point status flags, nor depend on the state of its caller’s floating-point status flags, unless the function is so documented
  2. A function call is assumed to require default floating-point control modes, unless its documentation promises otherwise
  3. A function call is assumed to have the potential for raising floating-point exceptions, unless its documentation promises otherwise

There are three basic approaches a function writer might take for dealing with the floating-point environment:

All three approaches are consistent with the C99 programming conventions.

9.5.1 Ignoring the floating-point environment

Because of programming convention (B), most functions can be written and tested, assuming the default to-nearest rounding and disabled traps, without particular concern for how the function would behave if called when alternate modes are in effect. If called (contrary to the convention) with a different rounding mode, then such a function might produce errors that are somewhat different than expected; if called (contrary to the convention) with traps enabled, then whether traps occur might not be predictable from the documentation of the function.

As anticipated by programming convention (C), exceptions may be signaled simply as a natural byproduct of floating-point operations. This will automatically flag problematic conditions, though some of these exceptions may be spurious in the sense that they do not convey anything meaningful about the result values of the function (see the example in 9.5.3).

Runtime performance is not an issue with this approach (9.5.1).

9.5.2 Guarding against alternate modes, signaling deserved exceptions

Facilities in the <fenv.h> header can be used to assure that a function will not be affected by changes to the dynamic modes. The following shows a basic scheme:
     #include <fenv.h>
     ... func(...) {
          #pragma STDC FENV_ACCESS ON
          fenv_t caller_env;
          float result;
          fegetenv(&caller_env);  // save caller's environment
          fesetenv(FE_DFL_ENV);   // install default environment
          ... // calculate result
          fesetenv(&caller_env);  // restore caller's environment
          return result;

Once installed, the default environment assures that default control modes are in effect for the function’s calculation. The save and restore of the caller’s environment insures that the call does not perturb the caller’s control modes—in accord with programming convention (A).

One objection to this version of the scheme is that any floating-point exceptions signaled by the function are not detectible outside the function. Restoring the caller’s environment includes restoring the caller’s exception flag state, thus losing information about exceptions signaled by the called function. This objection can be addressed by replacing the last fesetenv (that restores the caller’s environment) with feupdateenv. This, too, restores the caller’s environment, but also then re-signals (in the caller’s environment) any exceptions whose flags were raised by the function’s calculation.

As intended, this scheme assures that the function will produce identical results—regardless of what modes are in effect when the function is called. However, this constancy might defeat an attempt to use the sudden underflow mode to improve performance. It might also undermine the technique of running and comparing results under each of the rounding modes to assess numerical stability, as in Example 9.9.

See 6.3.2 regarding the performance implications of using #pragma STDC FENV_ACCESS ON and functions that access the floating-point environment.

9.5.3 Honoring the floating-point environment

Ideally, functions would honor modes and signal exceptions just like the basic floating-point operations do. High-quality library functions approach this ideal. Such functions are implemented to behave in a manner consistent with whatever modes are in effect when the function is called.

Honoring trap enablement modes means taking enabled traps when and only when the function, considered as an atomic operation, encounters the corresponding exceptional condition. The following function computes the float sum of two squares:

     float ssqf (float x, float y) { 
          return (double)x * x + (double)y * y; 

Since the code evaluates to double precision, the calculation of the sum of squares cannot overflow or underflow. Any overflow or underflow in the conversion to float for the return value is appropriate to signal (because then the sum of squares really is too large or small for the float format). This code also correctly signals the other exceptions, even inexact.

For functions that produce correctly rounded results, the rounding modes have precisely defined effects. For other functions, the directed rounding modes (upward, downward, toward zero) have the general meaning that the delivered result is in the specified direction (upward, downward, toward zero) from the mathematical result. The ssqf function does not always produce a correctly rounded result, but it does honor the general meaning of the directed rounding modes. Most calculations do not naturally honor the directed rounding modes, and implementing them to do so can be a considerable challenge. Note that to round the expression f(x) – g(x) in one direction requires rounding g(x) in the opposite direction.

Coding to signal only those exceptions that are deserved is typically much trickier than the preceding example might suggest, especially if wide evaluation is not an option, as even the simple calculation of x*x + y*y below illustrates. Without wide evaluation, a product x*x or y*y might underflow even though the sum of the products is not small. The following computes a long double sum of squares and hides any underflow signals that occur in the calculation of a final result that is not subnormal or zero:

     #include <fenv.h>
     #include <float.h>
     long double ssql(long double x, long double y) { 
          #pragma STDC FENV_ACCESS ON
          fenv_t caller_env;
          long double res;
          feholdexcept(&caller_env);   // save caller environment,
                                       // disable all traps, and
                                       // clear all flags
          res = x*x + y*y;
          if (!isless(res, LDBL_MIN)) feclearexcept(FE_UNDERFLOW);
          feupdateenv(&caller_env);    // restore caller environment,
                                       // then re-signal exceptions
          return res;

The feholdexcept and feupdateenv functions work together to control the exceptions that the function exposes to its callers. The pragma is required in order to restrict the optimizer from moving the arithmetic instructions across the instructions that effect the environment changes. The !isless construct is used to handle NaN inputs in a reasonable way: the alluring alternative

     if (res >= LDBL_MIN) feclearexcept(FE_UNDERFLOW);

would erroneously signal invalid if x or y is a (quiet) NaN and would erroneously fail to clear the underflow flag in the case that one of x or y is a NaN and the square of the other underflows.

See 6.3.2 regarding the performance implications of using #pragma STDC FENV_ACCESS ON and functions that access the floating-point environment.

9.6 Example: Calling C library functions from Fortran

To call a C math library function from a Fortran program, do the following:
  • Use an !$HP$ ALIAS directive to tell the compiler that the function’s arguments are passed by value.
  • Declare the function with the correct return type. To see the C return type, click on the function name in Table 4.
  • Link in the libm library. Linking with f90 does this automatically.

The following Fortran program calls j0, one of the Bessel functions in the C math library:

     C  bessel.f
     !$HP$ ALIAS J0 = 'j0'(%VAL)
             PROGRAM BESSEL
             DOUBLE PRECISION A, B, J0
             A = 1.0
             B = J0(A)
             WRITE(*,*)"Bessel of", A, "is", B

The %VAL argument indicates that the argument is passed by value. For details on the !$HP$ ALIAS directive, see the HP Fortran Programmer's Reference or the HP Fortran Programmer's Guide.

Compiling and running the program shows the following:

     % f90 bessel.f
        program BESSEL
     10 Lines Compiled
     % a.out
      Bessel of 1.0 is 0.765197686557967

9.7 Example: Using C99 complex arithmetic

C99 features allow for more natural and efficient codes than is possible with more traditional complex arithmetic facilities. The following simple example illustrates:

Let z = x + y i, where x and y are real, and compute

w = (z – i) / (z + 2i) if (z – i) / (z + 2i) lies inside the unit circle;
w = 0 otherwise.

The code might contain

     #include <complex.h>
     double x, y;
     double complex z, w;
     z = x + y*I;
     w = (z – I) / (z + 2*I);
     if (cabs(w) > 1) w = 0;

A complex value is constructed from the real variables x and y by means of the natural notation x + y*I, which the compiler implements without the use of actual floating-point instructions. The quotient (z – I) / (z + 2*I) too is expressed in a mathematical style notation. Because I has imaginary type and the promotion rules do not require real and imaginary operands to be converted to complex, the numerator z - I and denominator z + 2*I each can be evaluated with just one real floating-point instruction. In contrast, more traditional promotion rules require that the product 2*I be promoted to (2+0i)(0+i), implying four real multiplies and two adds. Thus, C99 features for complex arithmetic allow natural, mathematical style notation and entail built-in execution-time efficiency.

C99 infinity properties guarantee that evaluation of the quotient (z – I) / (z + 2*I) at the pole z = –2i will yield a complex infinity. Without the infinity property, nonzero / zero = infinite, the quotient in this case might yield NaN+NaN i (and would, if computed with the ordinary division formula), and then cabs(w) > 1 would be false (and would signal the invalid exception), and the result from the code above would be incorrect. Without the infinity property, the programmer would have to add extra code to test for z = -2i and handle that case separately. Thus, C99 features for complex arithmetic facilitate simpler algorithms.

9.8 Example: Precise representation for floating-point constants and data

Some numerical codes depend on floating-point constants or data being precise to the last bit. A careful algorithm may require precise values to preserve mathematical properties, or to match assumptions used in error analysis. Precise values may be necessary for results that are reproducible with other programming or execution environments. Historically, creating such constants has been problematic, forcing programmers to resort to mechanisms like type casts and unions, which might not be portable. Two C99 features help solve this problem:
  • hexadecimal floating-point representation
  • correct rounding between binary floating types and decimal character sequences

9.8.1 Hexadecimal floating-point representation

A number in a floating type can be printed with hexadecimal representation to obtain a character sequence that can be incorporated into a source constant or ASCII input datum that will result in a value equivalent to the original number. For example, to produce a hexadecimal constant equal the float value sqrtf(2), execute
     #include <stdio.h>
     #include <math.h>
     int main() { printf("%a\n", sqrtf(2)); }

to obtain


This character sequence can be incorporated into a float constant, 0x1.6a09e6p0F, or used as an input datum, to obtain the desired value. (Trailing zeros in the significand are optional.)

A hexadecimal floating constant produced as in the example above results in the same numerical value even if compiled using a wider evaluation method, which is not generally true for decimal floating constants (see 9.8.3). Likewise a hexadecimal floating-point ASCII datum will result in the same value if scanned into a wider floating-point object.

9.8.2 Correct rounding between binary floating types and decimal character sequences

C99 Annex F requires correct rounding for conversions between IEEE floating types and decimal character sequences with at most DECIMAL_DIG significant digits. DECIMAL_DIG is defined in <float.h> to be a number of decimal digits sufficient to distinguish all numbers in the widest supported IEEE binary floating type. On HP-UX, DECIMAL_DIG is 36, which is sufficient to distinguish all long double (quad) numbers.

Correct rounding implies a useful fact for devising decimal floating constants or ASCII input data: a number in a (binary) floating-point type can be converted to a decimal character sequence that can be converted back to recover the original number, provided the decimal character sequence contains enough significant digits. Table 5 shows the requisite number of significant decimal digits for each HP-UX floating type.

Table 15: Number of digits to distinguish floating point numbers
HP-UX floating type Significant decimal digits
float 9
double 17
extended 21
long double (quad) 36

To produce a decimal floating constant equal the float value sqrtf(2), execute

     #include <stdio.h>
     #include <math.h>
     int main() { printf("%.8e\n", sqrtf(2)); } 
     // 9 significant digits

to obtain


This character sequence can be incorporated into a float constant, 1.41421354F, to obtain the desired result, provided a wide evaluation method is not in effect. Similarly, 1.41421354 will produce the desired result when converted into an object of type float, but not wider.

9.8.3 Wide evaluation effects on floating decimal constants

Wide evaluation methods widen the evaluation format for floating constants of a narrower type, which generally produces a different value. For example, if FP_EVAL_METHOD == 0, the HP-UX default, then constant 0.1F has the float value nearest to one tenth. But if FP_EVAL_METHOD == 1, as under the HP-UX option -fpeval=double, then 0.1F has the double value nearest to one tenth, which is a more accurate representation of one tenth. The more accurate wider value is generally more useful, though not if an exact narrower value is required.

Floating decimal constants can be chosen to produce the same numerical value even when wide evaluation renders them in a wider format. The floating constants in <float.h> are chosen this way: thus, for example, FLT_MAX will have the maximum finite value in the float type regardless of the evaluation method. Alternatively, floating decimal constants can be chosen to take advantage of wide evaluation to produce a more accurate representation of a mathematical value. Examples below show both approaches.

To obtain a floating decimal constant that would not be affected by wider evaluation methods, use enough decimal digits to distinguish the desired value in the widest evaluation format. On HP-UX, the widest evaluation format is extended. Table 15 above shows that 21 decimal digits distinguish extended numbers. The following examples use 36 decimal digits, which are enough to distinguish even quad numbers.

To produce a float constant equal to the float value sqrtf(2), for any evaluation format precision up to quad, execute

     #include <stdio.h>
     #include <math.h>
     int main() { printf("%.35e\n", sqrtf(2)); } // 36 significant digits

to obtain


The float constant 1.41421353816986083984375F serves the purpose. As the trailing zeros might suggest, the float value sqrtf(2) is exactly 1.41421353816986083984375.

Typically, a more accurate representation of a mathematical value is more helpful than a less accurate one, and having a value that is independent of the evaluation method is not necessary in most cases. In the following code, sqrtl(2) produces the best possible long double (quad) approximation of √2 (sqrtl is a correctly rounded function). Printing with 36 significant decimal digits produces a decimal character sequence that will convert back to this same best long double approximation.

     #include <stdio.h>
     #include <math.h>
     int main() { printf("%.35Le\n", sqrtl(2)); }



1.41421356237309504880168872420969798F is a floating constant, of semantic type float, that will evaluate to an approximation of √2 accurate to the full precision of the evaluation format.

An L instead of F suffix would give the constant the semantic type long double. Then the constant would produce the best long double approximation of √2 regardless of the evaluation method. However, a long double constant might cause operations and functions to be evaluated to long double (even with narrow evaluation methods), which might have undesirable performance implications.

In rare cases, use of extra digits to express a floating decimal constant results in a very slightly less accurate value. This effect is similar to the double rounding errors that can occur with wide evaluation. See 9.4.

9.8.4 Constants for mathematical expressions

In the examples above, the square root functions were used to calculate correctly-rounded approximations of √2 as values in floating types. These values were printed to obtain floating decimal constants or data that would result in precise approximations of √2. A program might require floating constants representing more complicated expressions involving several operations and function values. The calculation of such an expression may suffer multiple round-off errors. Also, most math library functions are not correctly rounded for all inputs (see 5.5). The accuracy of the calculated expression, hence of the printed decimal character sequence, may be exceedingly difficult to estimate. In many cases, calculating to wider precision will produce a value that can be printed to obtain a constant or datum, for a narrower type, that is correctly rounded, or very nearly. See 7.

9.9 Example: Technique for assessing rounding-off

As noted in 3.3, round-off error is inherent in floating-point arithmetic. In Reference 9, W. Kahan discusses various schemes for assessing the effects of round-off. He argues that only thorough error analysis can be foolproof, but notes that error analysis for real-life problems is generally not practical. Among all approaches, he deems the following the most cost effective: “Repeat the computation in arithmetic of the same precision but rounded differently, say Down, and then Up, and maybe Towards Zero too, and compare three or four results.” Inordinately different results suggest that the problem, or the method of computation, is susceptible to serious roundoff error.

To support this technique for assessing the effects of roundoff, HP provides a compile/link option to install any of the IEEE rounding modes at program startup, as shown in table 15:

Table 16: Pre-setting the rounding mode
Rounding Option
to nearest +FP RN
upward +FP RU
downward +FP RD
toward zero +FP RZ

Note that the +FP option affects the link and has no effect on -c compilations that produce only an object file. (The +FP option can also be used to pre-enable trapping or sudden underflow mode.)

To demonstrate the technique, consider the following program which computes the roots of the quadratic equation 1.2x2 – 1.1x – 10-7 = 0 using the familiar quadratic formula:

     #include <math.h>
     #include <stdio.h>
     int main() {
             float a = 1.2, b = -1.1, c = -1e-7, r1, r2;
             r1 = (-b + sqrtf(b*b - 4*a*c)) / (2*a);
             r2 = (-b - sqrtf(b*b - 4*a*c)) / (2*a);
             printf("r1       = %g\n", r1);
             printf("r2       = %g\n", r2);

Building and running the program, pre-enabling each of the rounding modes in turn, produces

     % cc unstable.c -c
     % cc unstable.o -lm +FP RN ; a.out 
     r1       = 0.916667
     r2       = -9.93411e-08
     % cc unstable.o -lm +FP RU ; a.out 
     r1       = 0.916667
     r2       = -1.49011e-07
     % cc unstable.o -lm +FP RD ; a.out 
     r1       = 0.916666
     r2       = -4.96706e-08
     % cc unstable.o -lm +FP RZ ; a.out 
     r1       = 0.916666
     r2       = -4.96705e-08

The extreme inconsistency of r2 suggests a serious round-off problem. The next program uses a method that is known to be stable for computing the roots:

     #include <math.h>
     #include <stdio.h>
     int main() {
             float a = 1.2, b = -1.1, c = -1e-7, r1, r2;
             r1 = (-b - copysignf(sqrtf(b*b - 4*a*c), b)) / (2*a);
             r2 = c/(a*r1);
             printf("r1       = %g\n", r1);
             printf("r2       = %g\n", r2);

With this program, varying the rounding modes has little effect on the results:

     % cc stable.c -c
     % cc stable.o -lm +FP RN ; a.out 
     r1       = 0.916667
     r2       = -9.09091e-08
     % cc stable.o -lm +FP RU ; a.out 
     r1       = 0.916667
     r2       = -9.0909e-08
     % cc stable.o -lm +FP RD ; a.out 
     r1       = 0.916666
     r2       = -9.09092e-08
     % cc stable.o -lm +FP RZ ; a.out 
     r1       = 0.916666
     r2       = -9.0909e-08

(Note that the numerically unstable quadratic formula computed an erroneous r2 in all rounding modes.)

Examples 9.1 and 9.2 show how use of wider types for intermediate computation can reduce the effects of roundoff errors. The next program employs the unstable quadratic formula, but attempts to avoid the roundoff problems by using wider precision:

     #include <tgmath.h>
     #include <stdio.h>
     int main() {
             float a = 1.2, b = -1.1, c = -1e-7, r1, r2;
             r1 = (-b + sqrt((float_t)(b*b - 4*a*c))) / (2*a);
             r2 = (-b - sqrt((float_t)(b*b - 4*a*c))) / (2*a);
             printf("r1       = %g\n", r1);
             printf("r2       = %g\n", r2);

The (float_t) casts are necessary so that the type-generic sqrt will invoke a square root routine of the evaluation type. In this example, evaluation to double (using HP’s –fpeval=double option) successfully reduces roundoff effects:

     % cc stable2.c -c
     % cc stable2.o -lm -fpeval=double +FP RN ; a.out 
     r1       = 0.916667
     r2       = -9.09091e-08
     % cc stable2.o -lm -fpeval=double +FP RU ; a.out 
     r1       = 0.916667
     r2       = -9.0909e-08
     % cc stable2.o -lm -fpeval=double +FP RD ; a.out 
     r1       = 0.916666
     r2       = -9.09091e-08
     % cc stable2.o -lm -fpeval=double +FP RZ ; a.out 
     r1       = 0.916666
     r2       = -9.0909e-08

As shown in the examples above, to change the start-up rounding mode, it suffices to relink with the appropriate +FP option; recompiling is not required.

»  Return to top


1. HP's Mathematical Software Library (MLIB)
2. Intel® Itanium® Processor 9000 sequence
3. ISO/IEC JTC1/SC22/WG14 - C (homepage for the C standard)
4. Jean-Michel Muller's homepage
5. Prof. William Kahan's homepage
6. The UNIX® System Home Page
»  Return to top


1. Accuracy and Stability of Numerical Algorithms, by N. Higham, Society for Industrial and Applied Mathematics (SIAM), 1996
2. ANSI/IEEE Std 754-1985, IEEE Standard for Binary Floating-Point Arithmetic
3. ANSI/ISO/IEC 1539-1, 1997, Information Technology - Programming Languages - Fortran (Fortran90)
4. ANSI/ISO/IEC 14882:1998, Programming languages - C++
5. ANSI X3.159-1989, Programming Language C (C89)
6. “Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit,” by W. Kahan in The State of the Art in Numerical Analysis, (eds. Iserles and Powell), Clarendon Press, Oxford, 1987
7. “The Computation of Transcendental Functions on the IA-64 Architecture,” by J. Harrison, T. Kubaska, S. Story, and P. Tang, Intel Technology Journal, Q4, 1999 (143.0KB, PDF)
8. Elementary Functions: Algorithms and Implementation, by J-M Muller; 2nd ed., Birkhäuser, Boston (2006)
9. “How Futile are Mindless Assessments of Roundoff in Floating-Point Computation?” by W. Kahan (375.0KB, PDF)
10. “How JAVA's Floating-Point Hurts Everyone Everywhere,” by W. Kahan (274.0KB, PDF)
11. HP C/ANSI C Developer's Bundle, Hewlett-Packard Company
12. HP aC++, Hewlett-Packard Company
13. “HP Compilers for Itanium-based HP-UX,” Hewlett-Packard Company (403.0KB, PDF)
14. HP Fortran compilers, Hewlett-Packard Company
15. HP Fortran Programmer's Guide, Hewlett-Packard Company
16. HP Fortran Programmer's Reference, Hewlett-Packard Company
17. HP MLIB User's Guide, Hewlett-Packard Company
18. HP-UX floating-point guide (HP-UX 11.0), Hewlett-Packard Company (812.0KB, PDF)
19. IA-64 and Elementary Functions, Speed and Precision, by P. Markstein, Prentice Hall PTR, 2000
20. “IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic,” by M. Cornea-Hasegan and B. Norin, Intel Technology Journal, Q4, 1999 (565.0KB, PDF)
21. IEC 60559:1989, Binary floating-point arithmetic for microprocessor systems (previously designated IEC 559:1989)
22. “Inline assembly for Itanium-based HP-UX,” Hewlett-Packard Company (356.0KB, PDF)
23. “Inlining of Mathematical Functions in HP-UX for Itanium® 2,” by J. Thomas, Proceedings of the 2003 CGO, The International Symposium on Code Generation and Optimization, IEEE Computer Society, 2003
24. Intel® Itanium® Architecture, Software Developer's Manual, Intel Corporation
25. ISO/IEC C - Approved standards (C99)
26. “Marketing versus Mathematics,” by W. Kahan (104.0KB, PDF)
27. “OpenVMS floating-point arithmetic on the Intel Itanium architecture,” Hewlett-Packard Company (533.0KB, PDF)
28. The Single UNIX® Specification, Version 2 (UNIX 95), The Open Group, 1997
29. The Single UNIX® Specification, Version 3 (UNIX 03), The Open Group, 2002
30. Software Optimization for High Performance Computing, by K. Wadleigh and I. Crawford, Prentice Hall PTR, 2000
31. C/C++ Programmer's Guide and Common Run-Time Environment (CRE) Programmer's Guide, NonStop Technical Library, Hewlett-Packard Company
32. Decimal floating point for HP-UX, Hewlett-Packard Company
»  Return to top

Manage My AllianceOne Membership

 » Sign in
 » Join AllianceOne
 » Contact us