



Updated for March 2010 Math Library patch (April 2010) 



Overview 


Compilers and math libraries provide operations on floatingpoint numbers for modeling realworld engineering, scientific, and business problems. A floatingpoint number is the computer equivalent of scientific notation. For example, a millionth is expressed in scientific notation as 1.0 × 10^{6}; the binary floatingpoint representation is 1.0000110_{2} × 2^{20}, rounded to eight binary digits.
This guide describes floatingpoint arithmetic generally and its implementation in HPUX on Itaniumbased Integrity servers specifically. Programs that perform engineering, scientific, or business calculations depend on facilities covered in this guide. Also, many programs of a less numerical nature require some numerical computation illuminated in this guide. Simply reading in or printing out floatingpoint numbers may involve rounding to approximate values.
The HP floatingpoint model for Integrity servers is designed to enable the application programmer, who may not be a numerical expert, to code numerical calculations in a natural way and obtain useful answers and good runtime performance. It also offers expert features for fine control of numerical behavior and for achieving optimal performance.



The floatingpoint model represented in the HPUX C, C++, and Fortran compilers and libraries enables straightforward porting from other platforms. It is based on the IEEE floatingpoint standard and supporting specification in the C99 and UNIX 03 standards, and provides other popular features that might be needed for porting.
Chapters 19 describe binary floating point. Annex A describes decimal floating point, a new feature introduced in the HPUX 11i Version 3 September 2008 Update. Much of the general information in Chapters 19 pertains to decimal, as well as binary, floating point.
See Reference 27 regarding floatingpoint on Integrity OpenVMS servers. See Reference 31 regarding floatingpoint on Integrity NonStop servers.


Two important standards underlie the HP numerical model for Integrity servers:
 The IEEE floatingpoint standard
 The C99 language standard
The IEEE 7541985 floatingpoint arithmetic standard (IEEE standard) guarantees the availability and reliability of a core set of arithmetic operations. ISO/IEC C99, which is incorporated into UNIX 03, standardizes programming language support for the features of the floatingpoint standard and provides an extensive library of numerical functions and complex arithmetic based on the principles of the floatingpoint standard.
The HP C floatingpoint arithmetic and math library, as represented by the following headers,
<complex.h>
<fenv.h>
<float.h>
<math.h>
<tgmath.h>
adhere to the C99 specification, including Annex F (IEEE standard floatingpoint arithmetic) and Annex G (IEEE standardcompatible complex arithmetic).
HPUX 11i V3 Release B.11.31 is registered as conforming to the UNIX 03 Product Standard. This entails conformance of the C compiler and libraries to C99.
The ISO C++ standard, which predates C99, is based on C89. The HP C++ compiler and libraries offer the functionality of C99 through extensions to the C++ standard libraries. The next revision of the C++ standard (C++0x) is expected to incorporate the C99 functionality.
The HP Fortran compiler supports the Fortran90/95 feature set. It shares options, optimizations, and math function implementation with the C/C++ system.
The HP compilers and libraries offer IEEE and C99 standards compatibility, even with high levels of optimization. They provide reliable implementation of features of the standard that historically have not been practical for serious program use (even though widely supported in hardware). Additionally, they offer expression evaluation, elementary functions, and complex arithmetic that follow the spirit of the floatingpoint standard. Notable features related to the floatingpoint standard include:
 Wellspecified methods for wide expression evaluation, with declarable evaluation types and C typegeneric math functions
 Math functions with specialcase behavior similar to IEEE basic operations
 C/C++ complex arithmetic and functions, with specialcase behavior in the spirit of the IEEE standard, including imaginary types for C
 Correctly rounded binarydecimal conversion between each of four floatingpoint formats and up to 36 decimal digits, sufficient to distinguish all values in the widest (
quad ) format
 C/C++ functions and macros for manipulating rounding modes and exception flags
 Pragmas and compile options to guarantee reliable rounding modes and exception flags with minimal performance impact
 Pragmas and compile options to disallow contractions (including synthesis of a multiply and an add into a Fused Multiply Add), and C/C++
fma functions for controlled use of FMAs
 I/O for IEEE standard special symbols: NaNs (signifying NotaNumber) and infinities
 C/C++
NAN and INFINITY macros that can be used for compiletime initialization
 Maintenance of the sign of zero


» Return to top 

2 Floatingpoint data 


2.1 Types
HP C and C++ fully support four IEEE floatingpoint standard (IEEE standard) types, as shown in Table 1.




Each type also includes IEEE standard infinities (+∞ and ∞), signed zeros (+0 and 0), and NaNs (signifying NotaNumber). Subnormal numbers are sometimes called denorms.
The C/C++ long double type is equivalent to quad on HPUX (on both Integrity and PARISC systems).
Table 1: Floatingpoint types

Float 
Double 
Extended 
Quad 
Size 

bits 
32 
64 
80 
128 
bytes in memory 
4 
8 
16 
16 
Binary exponent range 
minimum (emin) 
126 
1022 
16382 
16382 
maximum (emax) 
127 
1023 
16383 
16383 
Significand precision 
bits (p) 
24 
53 
64 
113 
decimal digits 
7–8 
15–16 
19–20 
34–35 
Decimal range (approx.) 
maximum positive 
3.4e+38 
1.7e+308 
1.1e+4932 
1.1e+4932 
minimum positive normal 
1.2e38 
2.3e308 
3.4e4932 
3.4e4932 
minimum positive subnormal 
1.5e45 
5.0e324 
3.7e4951 
6.5e4966 
maximum negative subnormal 
1.5e45 
5.0e324 
3.7e4951 
6.5e4966 
maximum negative normal 
1.2e38 
2.3e308 
3.4e4932 
3.4e4932 
minimum negative normal 
3.4e+38 
1.7e+308 
1.1e+4932 
1.1e+4932 

2.2 Choosing a data type
Choosing a data type typically involves tradeoffs among
 precision
 range
 speed
 memory use
The float and double types are almost always adequate for input and output in real world problems. The diameter of the universe has been estimated to be 1.56e11 light years, which is (only) about 1.5e39 picometers (1 picometer equals 1e–12 meters). At the other extreme, the volume of a carbon atom is approximately 1e–24 cubic centimeters.
Computing with the extra precision and range of a type wider than the type for the final result generally eliminates most accuracy problems and intermediate overflows and underflows. The use of wider types for computation is a relatively simple and effective way to improve program robustness.
On Itaniumbased systems, most basic arithmetic operations on float , double , and extended values are performed in floatingpoint registers and are similarly fast for all types. An exception is division, which is implemented with multiple hardware instructions and is faster for narrower types. Likewise, quad arithmetic operations are implemented with multiple hardware instructions and therefore are not as fast as those of the narrow types. Functions in the libm math library are generally faster for the narrower types (see 5.4).
2.3 Definitions
A binary floatingpoint format is characterized by its precision, p, and exponent range, emin to emax. A floatingpoint number in this format is represented by its:
 sign—0 or 1
 exponent—an integer such that emin ≤ exponent ≤ emax
 significand—with p bits such that 0 ≤ significand < 2
and has the value
(–1)^{sign} × significand × 2^{exponent}
Floatingpoint representations are ordinarily normalized, meaning that the significand and exponent are adjusted so that the highorder bit of the significand is 1, thus providing maximum significance. For normalized numbers:
1 ≤ significand < 2
For example, in float (single) format, the largest representable number is:
significand 
= 
2–2^{–23} = 1.11111111111111111111111_{2} 
exponent 
= 
127 
value 
= 
(2–2^{–23}) × 2^{127} ≈ 3.403 × 10^{38} 
The smallest positive normalized number that can be represented in float format is:
significand 
= 
1 = 1.00000000000000000000000_{2} 
exponent 
= 
–126 
value 
= 
1 × 2^{–126} ≈ 1.175 × 10^{–38} 
For subnormal numbers, the smallest positive denormalized number that can be represented in float format is:
significand 
= 
2^{–23} = 0.00000000000000000000001_{2} 
exponent 
= 
–126 
value 
= 
2^{–23} × 2^{–126} ≈ 1.401 × 10^{–45} 


» Return to top 

3 Floatingpoint arithmetic 


The IEEE floatingpoint standard (IEEE standard) defines these operations for all operands in each format: add, subtract, multiply, divide, square root, remainder, convert to integer in floating format, convert between floating formats, convert between floating and integer formats, convert between floating formats and decimal representations, and compare.
3.1 Floatingpoint vs. real arithmetic
Computer floatingpoint numbers model the mathematical real numbers learned in math classes (such as algebra and calculus) and form the basis for our thinking about computational problems. We suppose we are computing with real number arithmetic even though our computer’s floatingpoint arithmetic is strikingly different. Fortunately, the model nearly always works.
The following sections discuss floatingpoint arithmetic according to the IEEE standard, which Intel® Itanium® and most other processors of the past two decades support. To help deal with the discrepancies between floatingpoint and real arithmetic, the standard specifies infinities (+∞ and ∞), signed zeros (+0 and 0), and NaNs (signifying NotaNumber) to complete the arithmetic: every standard operation has a welldefined result for all input operands in the format.




The standard also specifies signals for exceptional conditions, such as 1/0 or ∞  ∞. By default, these signals raise flags that the program can use to detect the occurrence of an exception.
3.2 Discrete vs. continuous
Real numbers are continuous, composing an unbroken real number line. Between any two real numbers are infinitely many more. In contrast, floatingpoint numbers are discrete, with density depending on the particular floatingpoint format and varying with the magnitude of the values. The distance from a small value to a nearest neighbor is small; the distance from a large value to a nearest neighbor is large. Sufficiently large floatingpoint numbers are all even integers. Because the floatingpoint representations are binary with fixed precision, the gaps between floatingpoint numbers double in width from one power of 2 to the next.
The illustration shows floatingpoint numbers with only 5 bits of precision. Hence, there are 2^{4} = 16 numbers for each binade (interval between consecutive powers of 2). The values in the floatingpoint types are much denser. For example, the double type has 2^{52} = 4503599627370496 values per binade.
3.3 Inexact vs. exact
Arithmetic operations—add, subtract, multiply, and divide—on real operands have results that are real numbers (except for division by zero). One divided by three is precisely the real number denoted by 1/3 = 0.3333…. But most arithmetic operations on floatingnumbers have mathematical results that fall into the gaps between the computer’s floatingpoint numbers, and therefore must suffer a roundoff error in order to be represented in the computer. In other words, results of computer arithmetic are generally an approximation of the mathematical result. Because most calculations involve a chain of operations, inexact intermediate results are reused with the possibility of errors accumulating. To minimize the effects of inevitable roundoff, the arithmetic is designed so that each operation incurs the smallest possible error. When roundoff does occur, the inexact exception is signaled.
3.4 Correct rounding
Correct rounding refers to the method described in the IEEE standard for its basic operations. The result of the operation is the format’s best possible approximation to what the result would be if computed exactly (with unbounded range and precision). The standard’s default rounding mode, round tonearest, chooses the nearest format value. However, if the mathematical value is half way between two format values, it chooses the one whose low order significand bit is 0, to avoid biasing the rounding effects.
The standard also specifies three userselectable, alternate rounding modes: round upward (toward +infinity), round downward (toward –infinity), and round toward zero (truncate). The following illustration shows a segment of the real number line that contains three consecutive float (single precision) values: 12^{24}, 1, and 1+2^{23}. The real number 12^{25} is half way between 1 and the next smaller float value 12^{24}. The real number 1+2^{24} is halfway between 1 and the next larger float value 1+2^{23}.
Table 2 shows the interval of real numbers that each rounding mode would round to 1 in the example shown above.
Table 2: Real number rounding
Rounding mode 
Real numbers x that round to 1 
to nearest 
1–2^{25} ≤ x ≤ 1+2^{24} 
upward 
1–2^{24} < x ≤ 1 
downward 
1 ≤ x < 1+2^{23} 
toward zero 
1 ≤ x < 1+2^{23} 
3.5 Decimal vs. binary
Most computers store numbers in binary format and use binary floatingpoint arithmetic, whereas we modern humans ordinarily represent numbers in decimal notation. Consequently, input and output of values represented with decimal notation entails roundoff errors. For example, the decimal value 0.1 (one tenth) represented in binary is a nonterminating, repeating fraction: 0.00011001100110011..._{2}, which rounds to 0.000110011001100110011001101_{2} in the standard binary singleprecision format. The differences between binary and decimal account for many user surprises, such as why the following program fails to compute the sum 1000.0.
[130] % cat y.c
int main() {
float sum = 0, tenth = 0.1;
int i;
for (i=0; i<10000; i++) sum += tenth;
printf("sum = %.1f\n", sum);
}
[131] % cc y.c
[132] % a.out
sum = 999.9
Thinking in terms of decimal numbers, one would expect the program to compute sum to be 1000.0 exactly. However, each time through the loop, the addition of the representation of decimal 0.1 in tenth to the value in sum suffers another floatprecision roundoff error; these roundoff errors accumulate so that the final binary value in sum is closer to 999.9 than to 1000.0. To minimize the effects of the discrepancy between the computer’s binary formats and external decimal representations, conversion between binary and decimal is correctly rounded.
See Annex A regarding decimal floatingpoint formats and arithmetic.
3.6 Finite vs. unbounded range
The real numbers are unbounded: there is no largest or smallest real number. The range of floatingpoint numbers is ultimately limited by the number of bits in the format allotted to represent the exponent. Table 1 shows the range limitations for the four floatingpoint formats.
3.7 Overflow
An operation whose finite mathematical result is too large in magnitude for the format is said to overflow. Although it is difficult to find examples of realworld data or result values beyond the range of the double type, overflows are more common in intermediate calculations. For example, computation of the vector norm sqrt(x*x + y*y) can overflow if the exponent of either x or y is merely half of the maximum exponent for the format—even though the value of the norm exceeds the format range only when both x and y are near the limit of the range.
To better handle overflow (and to better model the extended real number system), the IEEE standard augments the floatingpoint numbers with positive and negative infinities and specifies that an overflow exception is signaled when an overflow occurs.
Computer operations that overflow produce (in default rounding mode) an infinity result with the appropriate sign. Thus, infinities serve to represent values beyond the range that can be represented with floating point, including both extreme magnitude finite numbers and exact infinities. Operations such as 1.0/0.0 and INFINITY * 2.0 produce exact infinities. An overflow is not necessarily an error, because in some calculations an overflow may not matter. For example, 1 + 1/(x*x) will evaluate to 1 for all x whose magnitude is large enough that 1/(x*x) is too small to affect the result when added to 1.
3.8 Divisionbyzero
Infinities provide natural results for divisions of nonzero numbers by zero: 1/0 = ∞, 3/0 = ∞, 5/0 = ∞. Such operations signal the divisionbyzero exception. In contrast, the indeterminate division of zero by zero (0/0) yields NaN and signals invalid (not divisionbyzero).
Divisionbyzero is an example of an infinite result being produced from finite operands. Math functions not covered in the IEEE standard signal divisionbyzero to indicate this more general exceptional condition. For example, log (0) returns ∞ and signals divisionbyzero.
3.9 Underflow
An operation whose mathematical result is too small in magnitude for the format is said to underflow. Underflow compared to overflow is of less consequence, though trickier to understand. Underflow signifies that the result may have less precision than is usual for the format or may be zero in lieu of a tiny mathematical value.
On the number line of normal floatingpoint numbers, notice the relatively enormous gap between 0 and 2^{emin} (the smallest normal number).
There are 2^{p1} (where p is the format precision) points per binade, the interval between consecutive powers of 2. Thus, the gap between 0 and 2^{emin} is 2^{p1} times larger than that between 2^{emin} and the next larger format value. The IEEE standard fills in the gap around zero with subnormal numbers:
The extra values come from using the nonnormalized representations in the format:
(–1)^{sign} * significand * 2^{emin}, where 0 < significand < 1
Thus, IEEE arithmetic can return a subnormal value, albeit with lessthannormal precision (as some high order significand bits must be 0), instead of just ±2^{emin} or ±0 when the mathematical result is between 2^{emin} and 2^{emin}. This is called gradual underflow, in contrast to flushtozero or sudden underflow, whereby a zero is returned for all mathematical results between 2^{emin} and 2^{emin}. The underflow exception signals that an underflow condition may have resulted in a result of lessthannormal precision.
IEEE gradual underflow preserves the property
if x ≠ y then x – y ≠ 0
which sudden underflow does not. The following code depends on this property to avoid division by zero:
if (x != y) r = 1 / (x – y);
3.10 Signed zeros
The IEEE standard specifies both positive and negative zeros. These retain the sign after underflows and distinguish the reciprocals of positive and negative infinity. Thus, 1/(1/x) is equivalent to x if x is ±∞ or ±0.
In complex arithmetic computations, signed zeros can be used to distinguish the two sides of branch cuts and thereby enable more efficient code. For more information, see “Example: Complex Arithmetic Classes” in “How Java's FloatingPoint Hurts Everyone Everywhere” by W. Kahan and J. Darcy.
The signed zeros compare equal to each other (+0.0 == –0.0) . For most ordinary computation, the sign of zero can be ignored.
3.11 Invalid operations
Mathematics texts leave certain operations, such as 0/0, undefined. Sometimes this is because too many numbers would satisfy the defining property (for example, the defining property for 0/0=q is 0×q=0, which is satisfied by any real number q). Other times no numbers would satisfy the defining property (for example, a defining property for sqrt(3)=y is y^{2}=3, which is not satisfied by any real number y). A mathematician encountering one of these operations in a calculation will handle the situation as a special case, as in
F(x) = sin(x) / x, if x ≠ 0
F(x) = 1, if x = 0
When a computer program encounters one of these operations, the operation must return something or else stop the program. Stopping is not a viable option for most libraries, including ones that might be used by realtime codes. Returning any ordinary number when many or none would be appropriate would make it difficult for the code to recognize the problem and do anything reasonable. The IEEE standard refers to such operations—where any numerical result would be misleading—as invalid operations and specifies that the result is a NaN, a special symbol indicating NotaNumber. NaNs help make the floatingpoint number system algebraically complete: every operation has a welldefined result. Invalid operations signal the invalid exception.
Invalid operations in IEEE standard arithmetic include:
 ∞ – ∞
 0×∞
 0/0
 ∞/∞
 sqrt(x) where x < 0
 remainder(x,y) where y is zero or x is infinite
 conversion of an infinity, NaN, or outofrange finite floatingpoint value to an integer type
 comparison (<, <=, >, >=) where one or both operands is NaN
Note that the comparison operations cannot return a NaN result value, so the invalid signal (flag) is the only indication that program control may have been subverted by a test involving an unexpected NaN.
3.12 Comparisons
The trichotomy law for real numbers states that if a and b are real numbers, then exactly one of the following is true: a<b, a=b, or a>b. This trichotomy law would hold for floatingpoint numbers, including +∞ and –∞, but not when NaNs are included. If a or b or both are a NaN, then a and b are said to be unordered and a<b, a=b, and a>b are all defined to be false. Note that even a=a is false if a is a NaN. Thus, for the set of values in an IEEE floatingpoint type, the trichotomy law is replaced by: if a and b are values of the type, then exactly one of the following is true: a<b, a=b, a>b, or a and b are unordered.
NaNs may break codes that do not account for them; for example,
if (a<b) { /* do something */ };
else { /* do something else, assuming a >= b */ };
If a or b is a NaN, a<b will be false, so the else block will execute, though the assumption a>=b is false. For this reason, the operators <, <=, >, and >= signal the invalid exception when one or both operands is a NaN.
3.12.1 Quiet comparisons
The “quiet” comparison macros in the <math.h> header produce the same return value as the corresponding builtin comparison operator, but do not signal invalid if an operand is a NaN. Use the corresponding quietcomparison macro instead of a builtin operator — for example, isless(x,y) instead of x<y — if you know that the use of the comparison accounts for NaN operands, so that an invalid signal would be inappropriate.
3.13 Arithmetic properties
In real number arithmetic, operations have properties we take for granted when thinking about numbers and performing calculations. However, for floatingpoint, many of these properties do not hold true, or at least require modification.
Table 3: Real vs. floatingpoint arithmetic
Real arithmetic 
Floatingpoint arithmetic 
Closure: If a and b are real numbers then so are a + b and a × b
a / b is not defined if b=0

In IEEE standard floatingpoint (though not in earlier systems) every operation has a welldefined result, though typically not exactly the same as the mathematical result and maybe a special infinity or NaN symbol.
a / b is defined even if b=0 (NaN or ∞ result)

Commutativity:
a × b = b × a
a + b = b + a 
Yes
Yes

Associativity:
(a + b) + c = a + (b + c)
(a × b) × c = a × (b × c) 
No
No

Distributivity:
a × (b + c) = a × b + a × c

No

Inverse Properties:
a – b = a + (–b)
a / b = a × (1 / b)
If a – b = 0 then a = b 
Yes
No
Valid with gradual underflow as in IEEE standard floatingpoint, but not with sudden (flushtozero) underflow

Trichotomy: exactly one of
a < b, a = b, a > b is true 
A fourth alternative is possible: a and b are unordered 
3.14 Complex arithmetic
C99 introduces complex arithmetic into the C language, specifying:
 Complex types:
float complex , double complex , long double complex (HP C also provides extended complex and quad complex )
 Imaginary types (stored like corresponding real types):
float imaginary , double imaginary , long double imaginary (HP C also provides extended imaginary and quad imaginary )
 An imaginary unit (whose square equals –1):
I (I * I == –1)
 Efficient promotion rules for binary operations containing a mix of real, complex, and imaginary operands
 Infinity properties for basic operations, and a related performance pragma
 A header
<complex.h> defining complex , imaginary , and I and declaring complex functions
 Special cases in the style of the IEEE standard
C99 features allow for more natural and efficient codes than is possible with more traditional complex arithmetic facilities, such as those available with C++ and Fortran.
HP C for Integrity servers fully supports the C99 specification, including the portion in Annex G (IEEEcompatible complex arithmetic).
3.14.1 Promotion rules
C99 promotion rules for binary operations do not require that real or imaginary operands be converted to complex. For example, the expression
a(x + y i) + b i
is evaluated as
(a x + a y i) + b i = ax + (ay + b) i
rather than as
(a + 0i)(x + yi) + (0 + bi) = ... = ((ax – 0y) + 0) + ((ay + 0x) + b)i.
An optimizer might remove operations involving unwanted zeros introduced by more traditional promotion rules; however, such optimizations can yield unexpected result values. For example, 0y > 0 is problematic because 0y is NaN if y is infinite or NaN and is –0 if y is negative and finite.
3.14.2 Infinity properties
The infinity properties are as follows:
For z nonzero and finite,
 ∞ × z = ∞,
 ∞ × ∞ = ∞,
 ∞ / z = ∞,
 ∞ / 0 = ∞,
 z / ∞ = 0,
 0 / ∞ = 0,
 z / 0 = ∞.
A complex value is regarded as an infinity if either the real or the imaginary part or both are infinite (this includes a complex value for which one part is infinite and the other part is a NaN). This specification preserves the information that the magnitude is infinite, even when the direction is not meaningful. The complex absolute value function, cabs , is consistent with the definition of complex infinities in that cabs of any complex infinity, for example NaN + ∞ i, returns positive infinity.


» Return to top 

4 Floatingpoint environment 


The floatingpoint environment refers to:
 modes that control the behavior of floatingpoint operations
 flags that indicate the floatingpoint exceptions that have been signaled



The HP floatingpoint environment includes a rounding mode (round tonearest, round upward, round downward, or round toward zero); an underflow mode (gradual or flushtozero); and a mode for each floatingpoint exception to disable or enable trapping when the exception occurs. Additionally, the HP floatingpoint environment includes exception flags (invalid, divisionbyzero, overflow, underflow, and inexact) specified by the IEEE floatingpoint standard and supported in the C99 <fenv.h> header.
The floatingpoint environment is dynamic with global scope. The program can change modes at runtime and the new mode settings will affect all subsequentlyexecuted operations. Exceptional operations (e.g. 0.0/0.0) set exception flags (at runtime), and the flags remain set until the executing program clears them (thus the flags are “sticky”). Its dynamic, global nature means that care must be taken to correctly maintain and access the floatingpoint environment without overly constraining optimization; these issues are discussed in Example 9.5. The default mode settings serve most codes.
4.1 Rounding mode
The basic computational model for IEEE floatingpoint is to compute each result as if with unbounded precision and exponent range and then coerce that mathematical result to the destination format. The rounding mode determines how that conversion to a final result is done. For more information, see 3.4.
In all rounding modes, the sign of the final result retains the sign of the mathematical result. Thus, if the coercion to the destination format underflows to zero, the sign is preserved by the sign of zero.
By default (unless explicitly changed), rounding is tonearest. The final result is the value in the destination format that is closest to the mathematical result. If two format values are equal distance to the mathematical result, then the coercion chooses the one whose least significant bit is 0. If the magnitude of the mathematical result is at least 2^{emax} × (2 ‑ 2^{‑p}), which is the largest format value plus half a unit in the least significant bit, then the result is an infinity with the sign of the mathematical result.
In round upward mode, the final result is the value in the destination format closest to, but not less than, the mathematical result. If the mathematical result is greater than the largest format value, then the final result is +∞.
In round downward mode, the final result is the value in the destination format closest to, but not greater than, the mathematical result. If the mathematical result is less than the least format value, then the final result is ∞.
In round towardzero mode, the final result is the value in the destination format closest to, and not greater in magnitude than, the mathematical result.
4.2 Underflow mode
By default, underflow is gradual, as specified by the floatingpoint standard.
HP provides the alternate sudden (flushtozero) underflow mode. This mode significantly improves performance on Itaniumbased systems for some codes, typically singleprecision ones.
4.3 Trapping modes
By default, trapping is disabled for all floatingpoint exceptions.
HP provides the alternative of enabling trapping for each floatingpoint exception. Unless a trap handler is provided, a trapped floatingpoint exception will terminate the program.
4.4 Exception flags
Each floatingpoint exception (invalid, divisionbyzero, overflow, underflow, inexact) has its own flag. The flag values are nonzero (set) or zero (clear), indicating that the exception has or has not occurred, respectively.
When an exception occurs and trapping on the exception is not enabled (the default), the operation sets the exception flag. Once set, the flag remains set unless or until the program explicitly clears it; thus, the flag indicates whether an exception has occurred since the flag was last cleared (not whether the most recent operation was exceptional).
Functions in the <fenv.h> header clear, test, save, and restore exception flags, individually or collectively.
4.5 Math functions and the floatingpoint environment
Most of the math library functions—including exponential, logarithm, and trigonometric functions—are not directly covered by the IEEE floatingpoint standard. The C99 standard specifies exceptions for these functions following the principles of the floatingpoint standard. (For more information, see 5.7). However, the C99 standard does not specify the effects of alternate rounding modes on these functions. HP math functions follow the C99 standard. Additionally, most HP math functions reflect the rounding mode as shown below:
f(x) rounded down ≤ f(x) rounded to nearest ≤ f(x) rounded up


» Return to top 

5 Mathematical functions 


The HP libm library for Integrity servers provides mathematical functions for C, C++, and Fortran90. In concert with the compilers, it provides a leading combination of functionality, quality, and performance. With inlining and software pipelining, commonlyused math functions can achieve throughput comparable to handtuned vector routines without requiring user code to be written for a vector interface, and with no loss of accuracy or edgecase behavior; for example, the singleprecision exponential can exceed 400 million evaluations per second on a 1.5 GHz Itanium system. The math interface for application programming (“API”) encompasses C99, X/Open, and other popular functionality and fully supports four IEEE floating types.
Many of the techniques and algorithms used in the HP implementation of libm math functions for Integrity servers are described in IA64 and Elementary Functions, Speed and Precision, by Peter Markstein.




For more information about implementing elementary functions, including correctly rounded ones, see Elementary Functions: Algorithms and Implementation, by JeanMichel Muller.
HP also offers another highperformance math library, MLIB, which contains routines for common math algorithms (not in libm). MLIB has these six components:
 VECLIB—basic linear algebra (BLAS) and discrete Fourier transforms
 LAPACK—linear algebra package for the direct solutions of dense systems of equations
 ScaLAPACK—redesigned LAPACK for distributedmemory parallel computers
 SOLVERS—direct solvers for sparse linear systems, including graphpartitioning routines
 VMATH—vector math routines corresponding to widely used libm functions
 SuperLUDIST—direct solvers for sparse linear systems for distributedmemory parallel computers
Many of the techniques and algorithms used in the implementation of MLIB are described in Software Optimization for High Performance Computing, by Kevin Wadleigh and Isom Crawford.
The remainder of this chapter covers the libm math library. For more information about MLIB, see http://www.hp.com/go/mlib.
5.1 Usage
In C and C++ programs, include the headers for the libm math library functionality that is used. The C language permits declaring standard functions directly without including the header; however, this practice may result in substantially reduced performance, because the headers contain macros and pragmas that enable special optimizations, such as inlining.
When linking with cc or ld , add the lm option to link in libm. Linking with aCC or f90 will add the lm option automatically. The libm library is available as both shared (dynamic) and archive (static) libraries. The default is to link with a shared library. To link with an archive libm, use the l:libm.a option, instead of lm .
A correct version of libm for the runtime model (ILP32 or LP64) will be chosen automatically, based on the +DD option or default (+DD32 on HPUX).
The default versions of the math functions, as well as alternate versions that the compiler may invoke under certain options (for example, +Ofltacc=relaxed ), are all in the same libm libraries. There is no need to link with alternate libm libraries.
Some math library patches introduce new functions. If an application uses new functions in a patch and is linked with the shared libm library (which is the default for lm ), then any system the application runs on must have the same or a later Math Library Core patch, or have a later OS version. If the application is linked with the archive libm library (for example, with l:libm.a ), then a system does not require such a patch or later OS version in order to run the application.
See A.3 regarding usage of decimal floating point.
5.2 Quality dimensions
The HP libm math library is designed to excel in all quality dimensions:
The remainder of this section describes the libm library (including sequences that the compilers inline) in each of these dimensions. For more information about performance and behavior, see Chapter 6.
5.3 Functionality
The HP libm library provides the C functions—for each of the float , double , extended , and quad floatingpoint types — as shown in Table 4:
Table 4: HP libm library and the C functions
Function/macro family 
C doubleprecision functions 
exponentials 
exp , exp2 , exp10 , expm1 
logarithms 
log , log2 , log10 , log1p 
trigonometrics 
cos , sin , tan , cot , sincos 
inverse trigonometrics 
acos , asin , atan , atan2 
trigs with degree arguments 
cosd , sind , tand , cotd , sincosd 
inv trigs with degree outputs 
acosd , asind , atand , atan2d 
hyperbolics 
cosh , sinh , tanh , sinhcosh 
inverse hyperbolics 
acosh , asinh , atanh 
square & cube roots, reciprocal square root, hypotenuse 
sqrt , cbrt , rsqrt , hypot , invsqrt 
floating multiplyadd 
fma 
power 
pow/pown/powlln 
financial/growth 
compound , annuity 
exponent/significand 
logb , ilogb , frexp , significand 
scaling 
scalbn , scalbln , scalb , ldexp 
rounding to integral floating 
ceil , floor , trunc , round 
IEEE rounding to integral floating 
rint/nearbyint 
“Fortran” rounding to integer 
lround , llround 
IEEE rounding to integer 
lrint , llrint 
integral and fractional parts 
modf 
remainder/mod 
remainder , remquo , fmod , drem 
error functions 
erf/erfc 
gamma functions 
lgamma/lgamma_r , tgamma , gamma_r 
next value 
nextafter/nexttoward 
absolute value, copy sign 
fabs , copysign 
max, min, positive difference 
fmax , fmin , fdim 
Bessel functions of 1st & 2nd kind 
j0/j1/jn , y0/y1/yn 
notanumber 
nan 
classification 
isnan , isinf , isfinite , isnormal , signbit ,fpclassify , finite 
“quiet” comparison 
isless , islessequal , isgreater , isgreaterequal , isunordered , islessgreater 
complex exponential 
cexp , cexp2 , cexp10 
complex log 
clog , clog2 , clog10 
complex trigs 
ccos , csin , ctan , cis , cisd 
complex inverse trigs 
cacos , casin , catan 
complex hyperbolics 
ccosh , csinh , ctanh 
complex inverse hyperbolics 
cacosh , casinh , catanh 
complex magnitude & modulus 
cabs , carg 
complex square root and power 
csqrt , cpow 
complex parts, conjugate, projection 
conj , creal , cimag , cproj 
exception flags 
feclearexcept , fetestexcept , feraiseexcept 
opaque exception flag state 
fegetexceptflag , fesetexceptflag 
rounding direction modes 
fegetround , fesetround 
FP environment as a whole 
fegetenv , fesetenv 
hiding exceptions 
feholdexcept , feupdateenv 
flushtozero mode 
fegetflushtozero , fesetflushtozero 
trap enablement 
fegettrapenable , fesettrapenable 
The extended and quad types are extensions to the C/C++ programming languages. They are defined by the inclusion of one or more of the C headers <math.h> , <float.h> , <stdlib.h> , and <complex.h> . To avoid conflicts in existing programs that use the names for other purposes, the definitions of the extended and quad types, as well as all the functions and macros involving the types, become visible only if the fpwidetypes compile option is used. (The standard long double nomenclature is always visible.) Illustrating the nomenclature, the <math.h> header declares the following basee exponential functions:
float expf(float);
double exp(double);
long double expl(long double);
#ifdef _FPWIDETYPES // defined by fpwidetypes option
extended expw(extended);
quad expq(quad);
#endif
The <cmath> , <complex> , and <limits> C++ headers provide almost all the functions available with the C headers, with overloaded functions for the extended and quad types. The libm library supplies the underlying implementation for the C++ math functions in these headers.
The Fortran90 compiler uses libm functions for its standard math intrinsics, but does not yet extend the set of intrinsics to all the libm functions and does not support the 80bit extended type. Example 9.6 shows how to call C functions from Fortran.
The sinhcosh , cot , and cotd functions and float versions of the Bessel functions were added in 11.23 patch PHSS_29678 (and all subsequent Math patches and releases for Integrity servers).
The isnan , finite , drem , invsqrt , gamma_r , significand , cisd , clog2 , clog10 , cexp2 , and cexp10 functions and long double , extended , and quad versions of the Bessel functions, j0 , j1 , jn , y0 , y1 , and yn , were added in 11.23 patch PHSS_36344 and 11.31 patch PHSS_36351.
Some of these new interfaces provide functionality already supported in HPUX under a standard or preferred interface, but were added to facilitate importing legacy code developed on other platforms, particularly with GNU and Intel tool sets:
Table 5: Preferred Interfaces
Legacy 
Preferred 
isnan (function) 
isnan (generic macro) 
finite (function) 
isfinite (generic macro) 
drem 
remainder 
invsqrt 
rsqrt 
gamma_r 
lgamma_r 
The HPUX <cmath> C++ header does not provide overloaded functions for the legacy interfaces listed in Table 5.
5.3.1 PARISC compatibility
The API for C math functions for Integrity servers is a major superset of that for PARISC.
 It has a complete set of functions of types
long double , quad , and extended , none of which are available for PARISC.
 It has a complete set of
float functions; about half of them are available for PARISC.
 It has these additional functions, not available for PARISC, in all floatingpoint precisions:
 It has the legacy interfaces in Table 5, which are not available for PARISC.
 It has complex and imaginary functionality, including that specified in C99. The C
complex and imaginary types are not supported for PARISC.
 It provides the C99
<tgmath.h> header for typegeneric math functions, which is not available for PARISC.
For more information about math functions for PARISC, see Reference 18.
5.4 Speed
Two measurements are important for assessing math function performance: latency and throughput. Here latency refers to the time required for a library function, invoked by one external call, to deliver its result. Throughput refers to how many evaluations of a function can be completed in a given period of time (or, equivalently, the average number of cycles per evaluation) when the function is called many times in a loop. For throughput, function evaluations may execute concurrently, depending on the capabilities of the compiler and hardware.
HP compilers can inline most of the commonly used math functions (approximately 300 different sequences) and can software pipeline loops, exploiting key features of Itanium processors, to achieve speedups of 4×–10× versus sequential external calls. For details, see Reference 23.
Intel Itanium processors can issue two bundles of three instructions each per cycle, with up to two floatingpoint instructions per cycle. Latency is typically one cycle for integer instructions and four cycles for floatingpoint instructions. Full pipelining allows floatingpoint instructions started in consecutive cycles to execute in parallel (assuming no data dependencies). An inlined sequence of floatingpoint instructions typically leaves slots available for integer calculation, memory access, and so on. Also, latency between dependent floatingpoint instructions allows other floatingpoint calculations to be scheduled for concurrent execution.
Software pipelined loops showcase the benefits of inlining math functions. The Itanium features—predication, rotating registers, loop and epilog count registers, and loop branch types—enable the compiler to schedule multiple loop iterations to execute in parallel.
HP algorithms for math functions strike a balance between providing low latency and being amenable to software pipelining (see Reference 19). The measurements of latency in Table 6 and throughput in Table 8 are achieved with strictly equivalent algorithms; thus, results are identical whether a function is called conventionally or inlined and software pipelined by the compiler. Under +Ofltacc=relaxed , as in Table 7 and Table 9, results of library calls and inlined sequences may differ slightly. Measurements shown in this section are for ILP32 data mode; measurements for LP64 data mode are generally the same.
5.4.1 Latency
The performance of each function in the libm library depends in part on its algorithmic complexity, argument and result precision, and the compiler optimization level under which it is called. For optimization levels below +O2 , most libm functions are invoked via call branches to the highly accurate and robust default implementations, which have been optimized for source arguments of usual values (not edge or special cases).
Table 6 displays the minimum latency of such calls for the various precisions of common elementary functions in B.11.23 patch PHSS_40539 and B.11.31 patch PHSS_40540. The data reflect the time from each function call to when the result is available to the caller, assuming that the called function has source operands immediately available for its use in the source registers. Latency units are machine cycles. The tabulated performance levels are reached when the implementations are cacheresident (both instructions and data).
Table 6: Latency cycles on Intel Itanium processors
Function 
float 
double 
extended 
quad 
fabs 
6 
6 
6 
3 
sqrt 
30 
41 
45 
111 
exp 
36 
44 
57 
258 
log 
32 
35 
57 
237 
cos 
49 
53 
74 
288 
sin 
49 
53 
73 
316 
tan 
70 
81 
114 
329 
acos 
30 
63 
93 
316 
asin 
30 
63 
93 
319 
atan 
44 
53 
62 
285 
atan2 
53 
68 
80 
300 
cosh 
50 
48 
69 
241 
sinh 
48 
61 
69 
265 
tanh 
58 
75 
94 
391 
rsqrt 
26 
31 
43 
117 
pow 
62 
74 
105 
702 
Note that the latency of 6 shown in Table 6 for fabs includes a 2cycle overhead for the call. The compiler normally replaces a call to fabs with an inlined fmerge instruction (provided the <math.h> header is included), so the latency for fabs is actually only 4.
With +Ofltacc=relaxed , the compiler may invoke faster and slightly less accurate implementations for math functions. Table 7 shows latency measurements.
Table 7: Latency cycles on Intel Itanium processors with +Ofltacc=relaxed
Function 
float 
double 
extended 
sqrt 
22 
37 
37 
exp 
36 
39 
44 
log 
32 
32 
35 
cos, sin 
49 
53 
57 
tan 
70 
81 
93 
acos, asin 
29 
39 
63 
cosh 
50 
47 
48 
sinh 
48 
47 
63 
tanh 
58 
64 
84 
rsqrt 
22 
31 
31 
pow 
59 
67 
92 
5.4.2 Throughput with inlining
The compiler may inline the implementations of common math functions into the caller's code, provided the header that declares the function is included, thus enabling opportunities for performance optimization to exploit the parallelism of Itanium systems. At all optimization levels, the compilers inline very simple functions, such as fabs and copysign . At +O2 , they may inline most common functions, including ones shown in Table 8 and Table 9, depending on the compiler's inlining heuristics. At +O3 , they may inline several more functions and use more aggressive inlining heuristics. If the inlined function lies within a loop construct, the compiler may be able to software pipeline the loop body, thus affording much greater throughput of function results than that achievable by call branches to function implementations (see also 6.5.3). For example, an inlined expf function in a loop is capable of delivering a result every 3.6 cycles on the average, as opposed to every 36 cycles if the closed routine is called. Inlining and software pipelining in simple loops generally achieve throughput 4×–10× greater than with external calls to libm.
In Table 8 and Table 9, the throughput numbers (cycles per function output) are obtained from executing a loop that evaluates the function on an input vector of length 1024 and stores the outputs to another vector of the same length. The entries of the input vector are selected in such a way that any of the rare cases, e.g. invalid input arguments, or arguments at which the function overflows or underflows, or arguments of huge magnitudes for trigonometric functions, are excluded. The same loop is executed a few times and the minimum number of cycles of a particular loop execution is taken and divided by 1024, the vector length, to arrive at cycles per function output—the throughput number. This number may not be achieved when data are not cache resident. In some cases, the compiler unrolls the body of the loop 2, 4, or even 8 times.
Table 8: Throughput cycles on Intel Itanium processors
Function 
float 
double 
extended 
sqrt 
5.05 
6.64 
7.57 
exp 
3.57 
5.58 
6.72 
log 
5.08 
6.60 
14.12 
cos 
5.60 
7.10 
13.10 
sin 
5.60 
7.10 
14.09 
tan 
11.12 
13.12 
28.15 
acos 
11.58 
11.10 
27.12 
asin 
11.58 
11.60 
27.15 
atan 
14.08 
18.14 
23.15 
rsqrt 
4.08 
6.11 
8.09 
rint/trunc/AINT 
1.57 
1.56 
2.06 
round/ANINT 
2.54 
2.55 
2.56 
ceil/floor 
2.54 
2.55 
2.54 
NINT 
2.04 
2.10 
N/A 
Table 9: Throughput cycles on Intel Itanium processors with +Ofltacc=relaxed
Function 
float 
double 
extended 
sqrt 
3.03 
5.55 
5.55 
exp 
3.57 
4.18 
5.58 
log 
4.90 
6.08 
6.61 
cos 
5.60 
7.10 
8.08 
sin 
5.60 
7.10 
8.08 
tan 
11.12 
13.12 
16.11 
acos 
9.06 
11.10 
11.10 
asin 
9.57 
11.60 
11.61 
atan 
8.38 
11.13 
23.15 
rsqrt 
3.04 
5.05 
5.56 
pow 
10.17 
13.30 
27.23 
It is anticipated that future releases will perform even better due to ongoing performance tuning of implementation algorithms, advances in compiler technology, and faster versions of more functions for use under +Ofltacc=relaxed .
5.5 Accuracy
The HP libm library for Integrity servers has been carefully crafted to provide as much accuracy as possible commensurate with reasonably high performance. Accuracy for each elementary function is quantified by the largest rounding error (relative to the infinitely precise result), when rounding to nearest, measured in units in the last place (ulp). A one ulp rounding error for float precision signifies a relative rounding error of about 2^{24} up to 2^{23}. Correct rounding in IEEE roundtonearest mode implies a rounding error of at most 0.5 ulp.
Table 10 shows accuracy bounds for libm real functions under the default (IEEE roundtonearest) rounding mode:
Table 10: Accuracy in ULPS
type 
default 
relaxed 
float 
0.502 
1 
double 
0.502 
2 
extended 
0.550 
4 
quad 
0.502 
N/A 
All of the functions meet these error bounds thoughout their domain, except for gamma, Bessel, and extended and quad error functions (erf and erfc ).
Table 11 displays bounds for observed errors measured in ulps for various elementary functions, with roundtonearest rounding mode. The measurements reflect the worst observed error for each function in millions of test cases distributed throughout the function domain.
Table 11: Bounds for observed errors in ULPS
Function family 
float 
double 
extended 
quad 
exponentials 
.50005 
.5004 
.53 
.5001 
logarithms 
.5002 
.501 
.506 
.5001 
trigonometrics 
.50002 
.502 
.501 
.5002 
inverse trigonometrics 
.502 
.502 
.53 
.5002 
hyperbolics 
.5003 
.502 
.53 
.502 
inverse hyperbolics 
.5003 
.502 
.5004 
.502 
cube root, hypotenuse 
.50002 
.501 
.502 
.5005 
reciprocal square root 
.5005 
.501 
.500001 
.50005 
powers 
.50005 
.502 
.502 
.50004 
annuity, compound 
.50005 
.502 
.53 
.5002 
Other libm functions, such as square root and functions that convert to integer formats or integral values, round correctly as specified by either the IEEE standard or the function definitions.
With +Ofltacc=relaxed , the compiler may invoke faster and slightly less accurate implementations for math functions (for more information, see 5.4). Table 12 shows accuracy measurements with +Ofltacc=relaxed . Some entries—for example the float trigonometrics—are identical to the corresponding ones in Table 11 because the compiler does not select a special version for that precision of the function (though it might in the future).
Table 12: Bounds for observed errors in ULPS with +Ofltacc=relaxed
Function family 
float 
double 
extended 
square root 
.9 
.500001 
.7 
exponentials 
.50005 
1.8 
1.9 
logarithms 
.57 
.65 
3.3 
trigonometrics 
.50002 
.502 
3.4 
inverse trigonometrics 
1.0 
1.0 
3.7 
hyperbolics 
.5003 
1.8 
3.5 
inverse hyperbolics 
.75 
1.4 
2.5 
cube root 
.50002 
.501 
2.5 
reciprocal square root 
.7 
1.5 
2.9 
powers 
.5001 
1.5 
3.5 
5.6 Standards
The HP libm library was designed to support the IEEE floatingpoint standard and the C99 language standard, including Annexes F and G and Technical Corrigenda 1 and 2.
Additionally, the HP libm library includes all the math functions required by UNIX 98. The +Olibmerrno compile option (implied by AC89 and Aa and the default with c89 ) causes invocation of versions of the math functions that conform to the C89 and UNIX 98 standards. Under +Olibmerrno , <math.h> functions (and C++ <cmath> functions) set errno according to requirements of these standards and to C99's optional errno specification for <math.h> functions, and they return specialcase results according to UNIX 98 instead of C99 (Annex F), in the few cases where the standards differ. UNIX 03 aligns with C99.
5.7 Special cases
The HP libm library for Integrity servers carefully adheres to the detailed specification of special cases for <math.h> and <complex.h> functions in Annexes F and G of the C99 standard, which incorporates and extends the IEEE standard treatment of special cases.
Cases are termed “special” because they do not have a clearly best result. Both the IEEE standard and C99 specify that very problematic cases, for example 0.0/0.0, for which any result would be generally misleading, are to return NaN and raise the invalid exception. For other special cases, C99 specifies returning a numerical result, instead of NaN, if that result is useful in some important applications, especially if the diagnostic value of returning a NaN would be dubious. An example is the specification for 0^{0} = 1, which is discussed in “Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit”, by W. Kahan.
Careful treatment of special cases by the libm library and the compiler make program behavior more predictable, which facilitates writing robust, efficient code that handles special cases gracefully. It also makes debugging easier.


» Return to top 

6 Floatingpoint performance and behavior 


HP compilers and libm library are designed to provide the best performance possible, while delivering a very high level of overall quality, suitable for a broad range of programs. Overall quality entails predictability, accuracy, and careful treatment of special cases, all of which can affect the correctness and robustness of a program. Often, highest quality comes with at least some performance penalty. However, many important applications need all the speed they can get and are tolerant of less rigorous floatingpoint behavior, while others need even stricter behavior than is guaranteed by default. To accommodate different needs of different applications, HP compilers provide controls in the form of options and pragmas, thus allowing programmers who are not satisfied with the default settings to make intelligible tradeoffs between floatingpoint performance and behavior. The controls are of three sorts:
6.1 Quickpick floatingpoint options
The flowchart in 6.1 leads you through a series of questions to quickly determine which floatingpoint options will suit your needs. The remaining sections in the chapter explain the options in more detail.



*Some compile options have corresponding pragmas that have the same effect. Whereas an option affects the entire compilation unit, the pragma may be applied to particular blocks of code. The pragma may be used just on the code that needs it, without affecting performance or behavior of other code (see Table 13). Beginning with the AR0606 release, +Ocxlimitedrange is the C/C++ default under +Ofltacc=relaxed (and not under any other +Ofltacc setting). See 6.4.1.
Table 13: Options and corresponding pragmas
6.2 General controls for optimization
General optimization controls include optimization level options (for example, +O3 ), profiling, link options (for example, Bprotected ), and user assertions (for example, +Onoptrs_to_globals ), among others. Many of these options improve performance of floatingpoint code.
At the +O2 optimization level, the compiler optimizes most libm functions (ones that do not read or write global variables or objects pointed to by the function arguments) like builtin floatingpoint operators. Also, at this optimization level, the compiler may inline most of the commonly used libm functions, including logarithm, exponential, and trigonometric functions. Inlining can result in particularly dramatic speedup where the function call appears in a loop that the compiler can software pipeline (see 5.4.2).
At the +O3 optimization level, the compiler inlines even more libm functions (approximately 300 different sequences) and performs advanced loop optimization, which greatly improves performance of many floatingpoint codes.
The general optimization controls have no negative effect on documented floatingpoint behavior. Hence, these options can be chosen based on other considerations, such as compile time, without concern about sacrificing correctness or robustness of floatingpoint code.
6.3 Controls for special floatingpoint functionality
6.3.1 Using extended and quad types
The extended and quad types are provided as extensions to the C and C++ programming languages. They are defined by the inclusion of one or more of the headers <math.h> , <float.h> , <stdlib.h> , and <complex.h> . To avoid conflicts in existing programs that use the names for other purposes, the definitions of the extended and quad types, as well as all the functions and macros involving the types, become visible only if the fpwidetypes compile option is used. (The standard long double nomenclature is always visible.)
6.3.2 Using floatingpoint modes and flags
Application code that runs in nondefault floatingpoint modes (for example, a nondefault rounding direction) or tests exception flags must either be compiled with the +Ofenvaccess option or else be under the effect of a
#pragma STDC FENV_ACCESS ON
directive (specified in C99), for reliable behavior. Without one of these controls, the compiler might undermine expected behavior by moving or removing code, for example moving the multiply in
#include <fenv.h>
...
fesetround(FE_UPWARD);
a = b*c;
to before the fesetround call, or even to compile time if the compiler can determine the values of b and c .
For best performance, #pragma STDC FENV_ACCESS ON should be placed in the smallest blocks that enclose the code needing protection.
6.3.3 Controlling contractions (including FMA synthesis)
Application code that is not tolerant of contractions (which the compiler may synthesize by default) should be compiled with the +Ofltacc=strict option or be under the effect of a
#pragma STDC FP_CONTRACT OFF
directive (specified in C99). Contractions are optimizations that combine multiple languagelevel operations into just one machine instruction. For example, in
float r;
double x, y, z;
...
r = x * y + z;
the multiply, add, and assignment to float can be contracted into one fma.s instruction (the .s completer indicates rounding to single). Contractions are acceptable in most applications, and generally even improve accuracy. However, they can break codes that depend on operations being rounded to specific range and precision, and rarely even ones that do not. In the example of computing log(ab + cd), use of FMA in the calculation of a*b + c*d can yield a negative value, even though ab + cd > 0 if evaluated exactly. This will cause an invalid exception and NaN result from the log function.
On the other hand, controlled use of FMA can enable the design of much more efficient codes for some problems. FMA offers the obvious advantage of computing two operations in the time of one and with just one rounding error. Also, it is particularly effective for capturing extra bits of a calculation.
The following examples illustrate the use of FMA:
» Computing multiplication error
» Computing e^{xy} accurately and efficiently
6.3.4 Using errno
Codes that depend on <math.h> functions setting errno must be compiled with the +Olibmerrno option. Use of the +Olibmerrno option degrades performance of libm functions that may as a result of the option set errno or return results that differ from the C99 specification.
The IEEE exception flag mechanism functionally supersedes the errno mechanism for math functions. Four exception flags—invalid, dividebyzero, overflow, and underflow—provide finer identification of the special cases that errno classifies as domain or range errors. Each exception has its own flag, which is not overwritten when a different exception occurs, whereas the errno value is overwritten on various “error” conditions, including ones unrelated to floatingpoint. Exceptions are raised consistently by both math functions and also builtin floatingpoint operations, whereas only library functions set errno . The hardware exception flags can be accessed with less restriction on optimization than can the global errno variable. The +O[no]libmerrno option does not affect errno setting by nonmath functions.
6.4 Controls for tradeoff between speed and floatingpoint behavior
Other controls, principally the +Ofltacc=strictdefaultlimitedrelaxed compile option, are designed specifically to allow tradeoff between performance and floatingpoint behavior. Such options should be considered for performancehungry applications that are known to be tolerant of a less rigorous floatingpoint model or can be thoroughly tested.
6.4.1 Valuechanging optimization
+Ofltacc=default (the default) disallows optimizations that might change result values from those the programming language specifies, except that it allows contractions (including FMA synthesis). This option balances speed and behavior considerations and is suitable for most applications.
+Ofltacc=strict disallows all optimizations that might change result values, including contractions (notably FMA synthesis). This option provides the most predictable floatingpoint behavior, while admitting compatible optimizations, but loses the performance benefits of contractions. This option is equivalent to the legacy +Ofltacc option.
+Ofltacc=limited is equivalent to the default, except that infinities, NaNs, and the sign of zero may not be treated as specified. This option might provide some minor speedup for applications that do not depend on special values. Note that careful treatment of special values may improve robustness and facilitate debugging, even if the program is not specifically coded with special values in mind.
+Ofltacc=relaxed allows code transformations based on mathematical identities for real numbers, even if the transformations may change result values. With this option, the compiler may invoke slightly less accurate versions of math functions (see 5.4 and 5.5) and may inline math function sequences that yield slightly different result values than their counterparts in the libm library. This option delivers the best performance. Beginning with the AR0606 release, the +Ofltacc=relaxed option implies +Ocxlimitedrange , unless the +Onocxlimitedrange option or #pragma STDC CX_LIMITED_RANGE OFF directive is used. The legacy +Onofltacc option has the same effect as +Ofltacc=relaxed .
6.4.2 Flushtozero vs. gradual underflow
The +FPD option (for a link or compile command that produces an executable) causes program startup to install flushtozero underflow mode, instead of the default gradual underflow mode. In flushtozero mode, floatingpoint instructions return zero in lieu of nonzero results with magnitude less than 2^{emin}. Some codes are more robust when run with gradual underflow. If tiny values will not be encountered or not be significant in the calculation or logic, then flushtozero is fine.
On current Itaniumbased systems, use of flushtozero mode significantly improves performance of some singleprecision calculations; it rarely affects the performance of calculations in wider precision.
6.4.3 Limited range for complex operations
Implementations of complex multiply, divide, and absolute value by the usual textbook formulas
(a+bi)(c+di) = (ac–bd) + (bc+ad)i
(a+bi)/(c+di) = ((ac+bd) + (bc–ad)i) / (c^{2} + d^{2})
a+bi = sqrt(a^{2} + b^{2})
are prone to intermediate over/underflow and tend to return NaN results rather than potentially useful values. The a^{2}+ b^{2} in the formula for the complex absolute value overflows if the exponent of either a or b is more than half the maximum exponent for the evaluation type. Using the division formula, (1+0i) / (∞ +∞ i) yields NaN+NaN i, whereas a zero would be better. However, producing more reasonable results for these cases slows down the operations, even if the troublesome cases are not encountered.
The compile option +O[no]cxlimitedrange and the C99 directive
#pragma STDC CX_LIMITED_RANGE ONOFF
allow trading off careful behavior of the complex multiply and divide operations and cabs functions for performance.
C99 provides the CX_LIMITED_RANGE ON pragma to inform the compiler that the application code does not depend on operands beyond those that could be handled by the textbook formulas. This allows the compiler to deploy faster code for these operations than would be possible if the full range of inputs must be handled. With HP compilers and libm on Integrity servers, these performance controls are safe for float complex and double complex multiply, divide, and absolute value operations with extreme finite operands, because the implementation avoids intermediate overflow and underflow by using wide evaluation internally.
With the +Onocxlimitedrange compile option and the CX_LIMITED_RANGE OFF pragma, the complex multiply, divide, or absolute value operations avoid spurious overflow and underflow and honor the infinity properties specified in C99.
6.4.4 Fast and faster options
The +Ofast (or fast ) and +Ofaster options imply the +Ofltacc=relaxed and +FPd options, among others not specific to floatingpoint. The floatingpoint effects of these options can be overridden: for example, the combination +Ofast +FPd +Ofltacc=default specifies the +Ofast set of options, except with gradual (instead of flushtozero) underflow and with default instead of relaxed floatingpoint optimization.
6.5 Other techniques for performance
6.5.1 Inline assembly
HP C and C++ compilers support functionlike intrinsics that can be used to inline Itanium instructions into the code stream. This feature is called inline assembly (356.0K, PDF). Inline assembly offers the direct system accessibility of assembly language programming without the burdens of scheduling, register allocation, and predicate management, which are particularly daunting for Itanium. Itanium offers several features designed for floatingpoint efficiency that are not accessible through highlevel language operations and libraries, but are through inline assembly. For example, the FMA instructions take three operands, each in any of the three Itaniumsupported floatingpoint types (float , double , extended ), and another operand that determines the type for the rounded result. In the following code, the entire computation of res is effected with one fma instruction—without requiring additional instructions to convert the arguments or the result. The instruction reads the float , double , and extended arguments directly from the floatingpoint registers and delivers a result rounded to double (precision and exponent range) into the destination floatingpoint register.
#include <machine/sys/inline.h>
...
float f;
double d, res;
extended w;
...
res = _Asm_fma(_PC_D, f, d, w);
The implementation of the libm library makes extensive use of inline assembly, and achieves performance comparable to handtuned assembly language code.
Programming with inline assembly can be tricky, because certain assumptions valid with standard programming languages do not apply. Also, of course, code containing inline assembly is not portable to other architectures. Programmers who are very familiar with the Itanium architecture might consider inline assembly for performance critical codes intended specifically for Integrity servers.
6.5.2 Estimated frequency pragmas
Numerical codes commonly test for cases that can reasonably be assumed to be rare, such as inputs with extreme or unintended values. The Estimated_Frequency pragma can be used to provide the compiler with an estimate of the relative execution frequency—as a decimal fraction between 0.0 and 1.0—of a conditional block of code, compared to the immediately surrounding block. The compiler uses this information to optimize for paths through the code that the pragma has indicated will be frequent. Opportune use of the pragma for code that is very frequent or very infrequent can improve performance significantly. In the following, the pragma estimates that the code for the usual cases (where x is finite) will be executed 9999 times out of 10000:
if (isfinite(x)) {
#pragma Estimated_Frequency 0.9999
/* code for usual cases */
}
else /* x is infinite or NaN */ {
/* code for rare cases */
}
Equivalently, the code for the rare cases could have been estimated at 0.0001.
Also, the Estimated_Frequency pragma (with an integral estimate) can be placed in the body of a loop to estimate the trip count for the loop, for example
for (...) {
#pragma Estimated_Frequency 1000
...
}
6.5.3 Avoiding C/C++ aliasing problems
The capability of HP compilers to inline math functions and to software pipeline loops makes it easy to implement fast vector functions. For example, the following code implements a vector version of the fmax function:
#include <math.h>
void vfmax(int n, const double * a, const double * b, double * y) {
int i;
for (i=0; i<n; i++) *y++ = fmax(*a++, *b++);
return;
}
Given no additional information (and without interprocedural optimization enabled), the compiler is severely limited in its ability to software pipeline, because it must assume that a , b , and y might point to overlapping storage; hence, it must take care to assure that loads through a and b are not affected by any preceding stores through y. If you inform the compiler that a , b , and y will not point to overlapping storage, the compiler can produce much more efficient code.
With HP C/C++ compilers, you can use the +Onoparmsoverlap option to declare that function parameters do not point to overlapping storage.
Alternatively, you can use the C99 restrict typequalifier for this purpose, as shown in the following version of code:
#include <math.h>
void vfmax(int n, const double restrict * a,
const double * restrict b, double * restrict y) {
int i;
for (i=0; i<n; i++) *y++ = fmax(*a++, *b++);
return;
}
Compilers that do not yet support the C99 restrict keyword may provide similar functionality under another name. For example, the HP C compiler does support C99 restrict , but also provides __restrict , even in nonC99 modes, and can handle the code above if invoked with the Drestrict=__restrict option.
6.5.4 Optimizing sum reductions
By default, the HP C/C++ compilers do not reassociate floatingpoint operations. However, the stakes for reassociating are particularly high in the case of loop summations, called “sum reductions”. For example:
sum = 0;
for (i=0; i<n; i++) sum = sum + x[i];
The source code specifies the following calculation order:
...(((x[0] + x[1]) + x[2]) + x[3]) + ...
Doing the calculation this way might be crucial, for example, if samesign values of x had been ordered from least to greatest magnitude to minimize accumulation of rounding errors. However, more often the values of x are not specially ordered, nor does the programmer intend any particular order of summation. The compiler can greatly speed up the calculation of sum reductions by grouping the terms and summing the groups in parallel (tree fashion).
When AR0709 and later compilers reassociate terms in a float (single precision) or double sum reduction, they evaluate partial sums to the wider range and precision of the 80bit extended format, before delivering the final sum in float or double . This greatly reduces variations due to reordering.
HP compilers offer the +Osumreduction option to allow reassociation for computing sum reductions specifically, while not licensing the compiler to reassociate generally . The +Osumreduction option is in effect by default with the HP Fortran compiler. (The Fortran default can be overridden with +Onosumreduction .)


» Return to top 

7 Using wide types for robust code 


HP's floatingpoint facilities on Integrity servers enable programming techniques for simpler, more robust code. Describing code as more robust means that it delivers useful results for a broader range of inputs.
7.1 Why floatingpoint calculations fail
Results from floatingpoint calculations may not be useful because the input is too close to where the problem, or the method chosen to solve the problem, is mathematically illposed. A general rule applicable in various problem areas is
error ≈ (precision roundoff) / (distance from illposed)
(This rule is discussed in W. Kahan's paper Marketing versus Mathematics (104.0KB, PDF) in the section entitled "How do Errors get Amplified? By Singularities Near the Data...".)



Proportionality to the precision roundoff implies that using greater precision in the calculation reduces the error.
Also, calculations may fail because of intermediate overflow or underflow, though the desired result would be within range. In these cases, overflow and underflow might be avoided if intermediate calculations use greater exponent range.
7.2 Wide types help
Using a wider type for internal calculations is often an easy and effective strategy for improving robustness. The alternative of careful error analysis might show problems in the calculation that could be circumvented without (or with more local use of) wide precision; however, error analysis for even modest calculations typically proves daunting, even for numerical analysts.
To help with wide evaluation, HP offers two wide types, extended and quad , each with more range and precision than the float (single) and double types typically used for input data and results (see 2.1). The extended type has 11 more bits of precision than double , which generally reduces the number of failures due to precision roundoff by a factor of roughly 2000. The quad type has still 49 more bits of precision than extended . Both extended and quad provide 4 more exponent bits than double , increasing the exponent range by a factor of 16, which is usually sufficient to eliminate overflow and underflow in calculations on float or double inputs. The extended type is particularly attractive in this context because extended arithmetic on Itanium systems is as fast as float or double . Even quad might be considered for performancesensitive applications that need more precision than extended for a small number of operations and functions. Elementary functions for extended are about 0.7 times as fast as corresponding ones for double when both are calls to libm, rather than inlined. Elementary functions for quad are about 0.25 times as fast as corresponding ones for extended .
Wide floatingpoint types are a notable advantage for Itaniumbased systems over most RISC systems. RISC systems typically have no hardware support for types wider than double . They may provide the IEEE quad type implemented in software, or they may provide a nonIEEE, doubledouble type, using two double representations for the head and tail of a value. The doubledouble type is problematic because its exponent range is no wider than double , its precision depends on the value being represented, and regular treatment for edge cases is difficult. In contrast, Itaniumbased systems have full hardware support for the extended type, and have features that have been used to implement substantially faster IEEE quad functions than RISC systems can offer.
7.3 Extended and quad nomenclatures
HP C/C++ on Integrity servers provide several features that facilitate using the wide types. First, they support nomenclature for the various programming constructs involving extended and quad . The following prints the extended and quad logarithms of 1/3 of the maximum extended and quad values:
cc fpwidetypes
#include <stdio.h>
#include <math.h>
#include <float.h>
int main ( ) {
printf("%.20hLe\n", logw((1.0W/3.0W) * EXT_MAX));
printf("%.35lLe\n", logq((1.0Q/3.0Q) * QUAD_MAX));
}
The code illustrates C I/O specifiers, function suffixes, constant suffixes, and macros for the extended and quad types. Note that use of the nomenclature requires the fpwidetypes option.
7.4 Option for automatic wide evaluation
The C/C++ compile option
fpeval=floatdoubleextended
causes evaluation of narrower precision binary operations and floating constants to the type specified in the option.
fpeval=float —binary operations and floating constants are evaluated to their semantic type
fpeval=double —binary operations and floating constants of type float are evaluated to double
fpeval=extended —binary operations and floating constants of type float or double are evaluated to extended
The evaluation methods for fpeval=float and fpeval=double adhere to the C99 specification (corresponding to FLT_EVAL_METHOD = 0 and FLT_EVAL_METHOD = 1, respectively). The evaluation method for fpeval=extended follows the C99 specification for evaluation to long double , except that evaluation is to extended instead of long double . HP’s default is fpeval=float (consistent with PARISC compilers).
The irregular use of wide types by some compilers (for example, coercing wide representations when registers are “spilled”) has unfairly blemished the reputation of wide evaluation. HP compilers do wide evaluation based on the welldefined methods in C99. Results are just as predictable with wide evaluation as with any other method.
Note: HP C/aC++ Patches PHSS_39823 (for 11i v2) and PHSS_39824 (for 11i v3), June 2009, provided some important defect repairs for automatic wide evaluation. (See HPUX Patch data base.)
7.5 Evaluation types: float_t and double_t
The <math.h> header defines the C99 types float_t and double_t , which are the types used for evaluating float and double operations and floating constants. For example, by default float_t is float and double_t is double , but under fpeval=extended both are extended . The float_t and double_t types can be used for declaring intermediate variables that retain the extra range and precision in wideevaluated expressions (see Example 9.2.2).
7.6 Typegeneric math functions
When the C99 <tgmath.h> header is included, an unsuffixed math function takes on a type determined by its arguments. The mechanism supports all floating types. This feature is similar to C++ overloading and Fortran intrinsics, but restricted to system library functions of floating type. For example:
cc fpwidetypes
#include <tgmath.h>
extended x, y;
float complex z, r;
...
y = log(x); // equivalent to logw(x)
r = log(z); // equivalent to clogf(z)
Example 9.2 illustrates the use of wide types and their supporting features to improve robustness.


» Return to top 

8 Porting floatingpoint code 


The floatingpoint programming model is not the same on all systems. Programs that compile and run correctly on one system may fail to compile or may produce different results on another. The most common reasons related to floatingpoint are differences in
The HP floatingpoint programming model for Integrity servers is designed to facilitate importing code—most code will port easily and will work at least as well as on other systems.
8.1 Expression evaluation
The “expression evaluation method” refers to the way the compiler determines the format to which floatingpoint operations are rounded.



Some compilers evaluate each floatingpoint operation based strictly on the types of the operands. This method is sometimes referred to as “totype”. C99 implementations that define FLT_EVAL_METHOD to be 0 use this method.
Other compilers evaluate floatingpoint expressions in types that have more exponent range and/or precision than the operands. C99 specifies two, fully predictable, wide evaluation methods, which the implementation announces by defining FLT_EVAL_METHOD to be 1 or 2, according to whether its minimal evaluation type is double or long double . (Historically, use of wide types has been irregular and unpredictable from the program source, with narrowing conversions occurring simply because the compiler runs out of registers.)
The usual effect of differences in expression evaluation is that floatingpoint results differ in a few loworder bits. These minor deviations in results are usually not significant, though they commonly cause lastdigit differences in printed output. When the method of expression evaluation does matter, wide evaluation is more forgiving. Its extra precision may avoid troublesome accumulation of roundoff errors, particularly in numerically unstable code. Its extra exponent range may avoid intermediate overflow and underflow—calculations with the singleprecision type are most susceptible. Thus, programs compiled with wide evaluation and used successfully may fail when ported to a system whose compiler uses narrower evaluation. Porting from narrow to wider evaluation is less risky (ignoring historical irregular use of wide types).
Though uncommon, some codes depend on rounding to a particular precision and may fail totally if operations are rounded to a precision other than the one intended—even if rounded to more precision. Ideally, such code should contain explicit constructs to assure the requisite rounding regardless of the expression evaluation method.
HP C/C++ compilers provide three expression evaluation methods that help with porting risks. To minimize expression evaluation problems when porting code, choose an evaluation width on the portto system that matches or exceeds the evaluation width used on the portfrom system. For example, fpeval=extended is a good choice for code ported to HP Integrity servers from IA32 systems where compilers commonly evalute to the width, or at least the exponent range, of the IA32 floatingpoint registers (which use the 80bit IEEE extended format).
8.2 Optimization
Some optimization of floatingpoint code may change results from what the source code and underlying computer arithmetic would imply. Examples include:
Most compilers perform resultchanging optimizations, at least under some options. When porting code, refrain from enabling (nondefault) valuechanging optimizations until the code is up and running correctly on the new system. A more cautious strategy is to disable all valuechanging optimizations for this initial stage of the port.
8.2.1 Contractions
Contractions alter results by omitting rounding errors. They are the only valuechanging optimizations that C99 permits and the only ones that HP C and C++ compilers do by default. They are suitable for most applications. However, contractions do alter results in ways not predictable from the source code and different compilers use different methods to decide what to contract. Therefore, contractions are not suitable where lastbit reproducibility or specified rounding is strictly required.
8.2.2 Reassociation
If reassociation is allowed, the compiler may replace (a + b) + c with a + (b + c) and (a * b) * c with a * (b * c) where more efficient object code might result. Note that C and C++ specify addition and multiplication to be “left associative” so that a + b + c (without parentheses) is equivalent to (a + b) + c . It is easy to find examples where the associative properties fail for floatingpoint numbers and its application subverts some useful coding techniques. For example, if a exceeds b in magnitude, then the rounding error e in a + b can be computed as
s = a + b;
e = (s – a) – b;
provided the compiler does not reassociate the calculation of e and compute s  (a + b) , which would evaluate to zero (instead of the rounding error).
Most codes are written without concern for the subtleties for floatingpoint rounding and so (a + b) + c is as likely to be useful as a + (b + c) , even though the two may well differ in their loworder bits.
HP compilers may reassociate under the (nondefault) +Ofltacc=relaxed option. See 6.5.4 regarding reassociation in the computation of sum reductions.
8.2.3 Replacing division by multiplication of the reciprocal
The expressions a/b and a*(1/b) are not quite equivalent for floatingpoint numbers. Multiplication is faster than division on Itanium (and many other) processors. Thus, replacing a/b with a*(1/b) can speed up code if b is loopinvariant so that 1/b can be computed outside the loop, or if b is a constant and 1/b can be computed at compile time.
HP compilers may replace division by multiplication of the reciprocal under the +Ofltacc=relaxed option. They can achieve much of the benefit of this optimization even under +Ofltacc=strict (which disallows all valuechanging optimizations) because the Itanium instruction set facilitates fully correcting the optimized result with only a modest cost in performance.
8.2.4 Distributing and factoring
The distributive law a*(b + c) == a*b + a*c does not hold true for floatingpoint arithmetic. Particularly bad is the case where b and c are of nearly equal magnitudes but with opposite signs. Then the replacement of a*(b + c) by a*b + a*c replaces a cancellation involving exact values b and c , which will be exact, with cancellation of inexact values a*b and a*c , which may lose precision. For example, if double precision a = 3, b = 2^{52} + 1, and c = 2^{52}, then a*(b + c) evaluates to 3 exactly; however, a × b is 3 × 2^{52} + 3, which rounds to 3 × 2^{52} + 4 in double precision, and so a*b + a*c evaluates to 4—a 33% error.
HP compilers may replace a*b + a*c by a*(b + c) under the +Ofltacc=relaxed option.
8.2.5 Use of faster but less accurate sequences for builtin functions
Most system math functions are not correctly rounded for all domain values; their results vary from system to system, a little or a lot depending on the quality of the implementations. HP’s and other very highquality implementations are nearly correctly rounded: they achieve the ideal result in most cases and miss only where the mathematical result is close to half way between the ideal and delivered results. Two such implementations will produce mostly the same results, else results that differ only in the loworder bit.
HP compilers may invoke faster, slightly less accurate math function under the +Ofltacc=relaxed option. Even the “relaxed” functions are accurate to common industry practice.
8.2.6 Flushtozero underflow
Intel Itanium processor systems, among others, require extra processing time to handle subnormal values in the manner prescribed by the IEEE floatingpoint standard. Therefore, they provide a mode that trades off the benefits of gradual underflow in order to improve performance by “flushing” subnormal result values to zero.
Occasionally, a program that runs correctly with gradual underflow fails under flushtozero. More rarely does a program’s correctness depend on having flushtozero behavior. Also, flushtozero modes do not behave the same on all systems—some flush both subnormal operands and subnormal results to zero and others flush only results.
HPUX provides gradual underflow by default, a linker option +FPD to install flushtozero mode at program startup, and an <fenv.h> function fesetflushtozero to set flushtozero mode at execution time. Flushtozero mode on Itanium systems flushes subnormal results but not operands to zero. On PARISC systems, it flushes subnormal operands, as well as results, to zero.
8.3 Floatingpoint formats
Today, most implementations use the IEEE standard single and double formats for the standard float (real ) and double (real*8 ) types. However, differences abound in the support of wider floatingpoint types. The C/C++ long double format is required only to be at least as wide as the double one. Formats in use for long double include:
 128bit IEEE double extended
 80bit IEEE double extended
 64bit IEEE double
 “doubledouble”—a headtail representation using a pair of IEEE doubles
Faithful porting is easiest between systems using the same formats for the same types. Generally, having portto formats that are at least as wide as the portfrom formats helps preserve correctness. However, performancesensitive code that uses the long double type where just a little extra precision or range is needed might be better served by faster 80bit arithmetic than slower 128bit arithmetic.
Of course, a portto platform that uses the double format for the long double type will not be adequate for code that depends on long double having extra precision or range.
The (nonIEEE) doubledouble format provides more precision than double , but is problematic because it has no extra exponent range beyond double , it has no fixed precision, and it does not support regular treatment of special cases.
HPUX uses a 128bit IEEE doubleextended format for long double , which allows importing long double code without loss of precision or exponent range. The HP C/C++ compilers for Integrity servers also fully support both 80 and 128bit types under the names extended and quad .
8.4 Functionality
Not all systems offer the same functionality. Missing types or interfaces usually cause compiler diagnostics when programs are ported. Behavior differences between the portfrom and portto platforms may cause more subtle problems that can be detected only through extensive testing or use.
Standards help. Older standards such as C89, UNIX 95, and Fortran90 are widely implemented. The C99/UNIX 03 standard offers extensive functionality for numerical programming, including support for the IEEE floatingpoint standard, and is mostly—if not entirely—available for major systems.
HPUX for Integrity servers provides the standard mathrelated interfaces as described in Chapter 1, as well as other frequently used (nonstandard) math functions.
8.5 Treatment of special cases
Special cases (for example, division by zero, overflow, and so forth) are inherent in floatingpoint arithmetic and robust code must handle them in a reasonable way—deliver a useful result or an intelligible diagnostic. The IEEE standard provides special values (infinity and NaN) and exception flags (invalid, overflow, underflow, and divisionbyzero) to help deal with special cases. However, historically, programming language and library support for these features has been irregular, incomplete, and inconsistent from one system to the next. Thus programs, particularly ones that need to be ported, have been unable to use the features.
The C99/UNIX 03 support for the IEEE standard, including the features for handling special cases, is now becoming widely available. HP C compilers and libraries fully adhere to the C99 specification.
With the +Olibmerrno option, HP math functions set errno according to the UNIX 95 standard and the optional errno specification in C99. Additionally, they return specialcase results according to UNIX 95 instead of C99, in the few cases where the standards differ. Because of the performance costs, this option should be used only if truly required.


» Return to top 

9 Examples 

9.1 Example: Using wide accumulators
Wide accumulators are effective for improving accuracy and avoiding intermediate overflow and underflow. Following are three coding approaches to using wide accumulators that have different implications for compatibility, quality, and performance.
9.1.1 Using automatic wide evaluation
The following code computes the double precision dot product of vectors x and y :
#include <math.h>
double ddot(int n, const double * x,
const double * y) {
double_t sum = 0;
int i;
for (i=0; i<n; i++)
sum += (*x++) * (*y++);
return sum;
}
If expressions are automatically evaluated to the 80bit extended format, as under HP’s “fpeval=extended fpwidetypes ” options, then double_t is extended (per C99), and the extra 11 bits of significance in extended vs. double will reduce the accumulated roundoff error by a factor of about 2000—a factor of 2 for each extra bit.
The following computes the doubleprecision 2norm of the vector x :
#include <tgmath.h>
double dnorm2(int n, const double * x) {
double_t sum = 0;
int i;
for (i=0; i <n; i++, x++)
sum += (*x) * (*x);
return sqrt(sum);
}
If expressions are automatically evaluated to the 80bit extended format, as under HP’s “fpeval=extended fpwidetype ” options, then double_t is extended (per C99), the typegeneric sqrt is actually the extended sqrtw , and the extra exponent range of extended versus double essentially eliminates the possibility of intermediate overflow or underflow. Without wide evaluation, this code will fail if all x components are too small (so their squares underflow significantly) or at least one is too large (so its square overflows).



On Itanium systems, arithmetic with extended is essentially as fast as with double ; therefore, with HP compilers for Integrity servers, any performance cost for the quality benefits of using extended is negligible in these examples. Moreover, without wide evaluation, substantially more complicated code is required for comparable quality. Robust dnorm2 code for a system that does not support a wider type typically involves scaling.
The codes above will compile and run on any C99 implementation. However, the described benefits depend on evaluation to a format with more precision and range than double , and C99 does not require that implementations provide such evaluation.
9.1.2 Using long double
The following codes use the long double type explicitly for internal computation; hence, their quality does not rely on the implementation to provide wide evaluation, but only on long double being wider than double . The performance cost is significant on HPUX and other systems that implement long double as quad .
double ddot(int n, const double * x, const double * y) {
long double sum = 0;
int i;
for (i=0; i<n; i++) sum += (long double)(*x++) * (*y++);
return sum;
}
#include <math.h>
double dnorm2(int n, const double * x) {
long double sum = 0;
int i;
for (i=0; i<n; i++, x++) sum += (long double)(*x) * (*x);
return sqrtl(sum);
}
9.1.3 Using extended
For HP and other systems that support the same nomenclature for fast extended types, the codes can be written explicitly for extended evaluation, thus assuring both quality and performance, without relying on implicit wide evaluation.
#include <float.h>
double ddot(int n, const double * x, const double * y) {
extended sum = 0;
int i;
for (i=0; i<n; i++) sum += (extended)(*x++) * (*y++);
return sum;
}
#include <math.h>
double dnorm2(int n, const double * x) {
extended sum = 0;
int i;
for (i=0; i<n; i++, x++) sum += (extended)(*x) * (*x);
return sqrtw(sum);
}
With HP compilers, these codes must be compiled with the +fpwidetypes option and ddot must include <float.h> or another header that defines the extended type. Table 14 summarizes the three approaches:
Table 14: Comparison of approaches for wide accumulation
Approach 
Portability 
Quality 
High performance 
Using automatic wide evaluation 
Yes (assuming C99) 
Yes for HP—requires support for evaluation wider than double 
Yes for HP—requires fast evaluation type 
Using long double 
Yes 
Yes for HP—requires long double being wider than double 
No for HP—requires fast long double 
Using extended 
Requires extended nomenclature 
Yes 
Yes for HP—requires fast extended 
9.1.4 Sum reduction
The +Osumreduction option is appropriate for the examples above (unless the data has been ordered to control roundoff or lastbit reproducibility is required) and improves performance substantially by allowing the compiler to evaluate parts of the sum concurrently. Wide evaluation hides most of the result variability from summing in different orders.
9.2 Example: Using wide types for robust log(ab + cd)
The problem is to calculate log(ab + cd) where double is sufficient to represent the inputs a, b, c, and d, and ab + cd is known to be nonnegative. Here is a straightforward solution using double :
#include <math.h>
double a, b, c, d, res;
double s;
...
s = a*b + c*d;
res = log(s);
However, this code is problematic. First, a log function is necessarily unstable where a*b + c*d is inexact and near 1. For example, the calculation loses all significant bits if a*b = 1 and c*d < 2^{53}, because then a*b + c*d yields 1 and log of 1 returns 0, whereas log(1 + x) is approximately x when x is small.
Also, a*b + c*d can lose all significant bits from cancellation if a*b and c*d are nearly negatives of each other and one or both are inexact.
Finally, the computation of a*b + c*d can overflow or underflow the double format, though the true value of log(ab + cd) is always safely in the double range.
9.2.1 Wide expression evaluation
Simply compiling with fpeval=extended makes the code in 9.2 somewhat more robust. The effect is that a*b + c*d is computed in extended , retaining 11 extra bits in the products, which greatly reduces the number of possible inputs that would cause harmful cancellation. However, the assignment to s still yields a double value. The log arguments near 1 are not represented with any extra precision. And, the extended sum a*b + c*d can overflow or underflow when coerced to the range of double .
9.2.2 Wide local variables and typegeneric math functions
The following code to calculate log(ab + cd) addresses the three robustness problems noted in 9.2:
cc fpwidetypes fpeval=extended ...
#include <tgmath.h>
double a, b, c, d, res;
double_t s;
...
s = a*b + c*d;
res = log(s);
Here the fpeval=extended option causes double_t to be defined as extended and the inclusion of <tgmath.h> causes the reference to log to be type generic. Thus, s is computed fully in extended , the extended log function is used, and coercion to double occurs only in the final assignment to double . With this version, the likelihood of encountering each of the two precision problems has been reduced by a factor of roughly 2000, and the risk of intermediate overflow and underflow has been completely eliminated.
Note that writing
res = log(a*b + c*d);
does not suffice for fully wide calculation, because the semantic type of the argument, not its evaluation format, determines the type of the generic log function. (This specification was chosen for C99 so that enabling wide evaluation would not automatically incur the extra performance cost of wider library functions.)
9.3 Example: Using FMA to simplify algorithms
9.3.1 Computing multiplication error
Fused MultiplyAdd can be used to compute the exact roundoff error e in the product x * y :
p = x * y;
e = fma(x, y, –p);
Note that the error cannot be computed reliably with
e = x * y – p;
Under +Ofltacc=default and more lenient options, the compiler is allowed, but not required, to synthesize FMAs. The compiler might evalute x * y  p with a multiply and a subtract, instead of with an FMA, which would always produce zero instead of the error in p .
On HP systems, the fma function (for float , double , and extended ) is implemented with an inlined Intel Itanium Floatingpoint Multiply Add instruction. The fma function, together with the facility to disable compiler syntheses of FMA, provide the control needed for algorithm development (see 6.3.3). The compiler may generate a Floatingpoint Multiply Subtract instruction for fma(x, y, –p) . Likewise, the compiler may generate a Floatingpoint Negative Multiply Add instruction for a call such as fma(x, y, z) .
9.3.2 Computing e^{xy} accurately and efficiently
This example shows the analysis behind the use of FMA to compute e^{xy} to extended precision, accurately and efficiently, without use of a wider type. Consider this straightforward program:
extended x, y, r;
r = expw(x*y);
The relative error in the exponential function is proportional to the absolute error in the argument. The rounded result of x*y can be in error by as much as about 2^{50} (which is a unit in the last place of 2^{14}, the largest value for which expw does not overflow). An absolute error of 2^{–50} in x*y induces a relative error of about 2^{–50} in expw . Thus, the low order 14 bits of the 64bit result significand may be corrupted.
The product xy can be written as the exact sum high + low, where high is the computed x*y and low is the error, which fmaw(x,y,x*y) computes exactly. Because
e^{low} = 1 + low + low^{2}/2 + ..., where low < 2^{50}
we have the approximation
e^{xy} = e^{high+low} = e^{high}×e^{low} ≈ e^{high}×(1+low) = e^{high}+low×e^{high}
Computation of e^{high} produces, at most, just slightly more than .5 ulp error and the multiplyadd produces, at most, .5 ulp error. Therefore, the following code calculates e^{xy} with, at most, just slightly more than 1 ulp error.
#include <math.h>
extended x, y, r, high, low;
high = x*y;
low = fmaw(x,y,–high);
r = expw(high);
r = fmaw(r,low,r);
9.4 Example: Double rounding error
The function
float ssqf (float x, float y) {
return (double)x * x + (double)y * y;
}
discussed in 9.5.3 does not always round correctly. The double format is enough wider than float that both products are exact, but the sum can produce a double precision rounding error and the conversion to float can produce a second rounding error. In most cases, the double precision rounding error is inconsequential to the float result, but not always. For example, if x = 1 + 2^{12} and y = 2^{27}, then ssqf returns the value 1 + 2^{11}, though the correctly rounded float sum of squares would be 1 + 2^{11} + 2^{23}. An error incurred as a result of rounding to a wider format and then rounding again to a narrower format is sometimes referred to as a “double rounding error.” The following uses an HP inline assembly intrinsic to round the sum directly to single precision, thereby avoiding the double rounding error and producing a correctly rounded result:
#include <machine/sys/inline.h>
float crssqf (float x, float y) {
return _Asm_fadd(_PC_S, (double)x * x, (double)y * y);
}
Either implementation could use FMA to save an operation, though this would have no affect on the numerical result, since the products are computed exactly with or without FMA. Here is an implementation equivalent to ssqf but using the double fma function:
#include <math.h>
float ssqf (float x, float y) {
return fma(x, x, (double)y * y);
}
9.5 Example: Managing the floatingpoint environment
The dynamic, global nature of the control modes in the floatingpoint environment is potentially hazardous in that a function might be called with different modes in effect than the function was designed and tested for. The dynamic, global nature of the exception flags is potentially confusing (or worse) because exceptions that may be spurious byproducts of a function’s internal calculations are globally visible to the function’s callers. For HP and other systems that support trap enable modes, the management of modes and the management of flags are closely related, because exception signals are affected by the modes and affect the flags.
To help deal with the hazards of the floatingpoint environment, C99 specifies the following programming conventions:
 A function call does not alter its caller’s floatingpoint control modes, clear its caller’s floatingpoint status flags, nor depend on the state of its caller’s floatingpoint status flags, unless the function is so documented
 A function call is assumed to require default floatingpoint control modes, unless its documentation promises otherwise
 A function call is assumed to have the potential for raising floatingpoint exceptions, unless its documentation promises otherwise
There are three basic approaches a function writer might take for dealing with the floatingpoint environment:
All three approaches are consistent with the C99 programming conventions.
9.5.1 Ignoring the floatingpoint environment
Because of programming convention (B), most functions can be written and tested, assuming the default tonearest rounding and disabled traps, without particular concern for how the function would behave if called when alternate modes are in effect. If called (contrary to the convention) with a different rounding mode, then such a function might produce errors that are somewhat different than expected; if called (contrary to the convention) with traps enabled, then whether traps occur might not be predictable from the documentation of the function.
As anticipated by programming convention (C), exceptions may be signaled simply as a natural byproduct of floatingpoint operations. This will automatically flag problematic conditions, though some of these exceptions may be spurious in the sense that they do not convey anything meaningful about the result values of the function (see the example in 9.5.3).
Runtime performance is not an issue with this approach (9.5.1).
9.5.2 Guarding against alternate modes, signaling deserved exceptions
Facilities in the <fenv.h> header can be used to assure that a function will not be affected by changes to the dynamic modes. The following shows a basic scheme:
#include <fenv.h>
... func(...) {
#pragma STDC FENV_ACCESS ON
fenv_t caller_env;
float result;
fegetenv(&caller_env); // save caller's environment
fesetenv(FE_DFL_ENV); // install default environment
... // calculate result
fesetenv(&caller_env); // restore caller's environment
return result;
}
Once installed, the default environment assures that default control modes are in effect for the function’s calculation. The save and restore of the caller’s environment insures that the call does not perturb the caller’s control modes—in accord with programming convention (A).
One objection to this version of the scheme is that any floatingpoint exceptions signaled by the function are not detectible outside the function. Restoring the caller’s environment includes restoring the caller’s exception flag state, thus losing information about exceptions signaled by the called function. This objection can be addressed by replacing the last fesetenv (that restores the caller’s environment) with feupdateenv . This, too, restores the caller’s environment, but also then resignals (in the caller’s environment) any exceptions whose flags were raised by the function’s calculation.
As intended, this scheme assures that the function will produce identical results—regardless of what modes are in effect when the function is called. However, this constancy might defeat an attempt to use the sudden underflow mode to improve performance. It might also undermine the technique of running and comparing results under each of the rounding modes to assess numerical stability, as in Example 9.9.
See 6.3.2 regarding the performance implications of using #pragma STDC FENV_ACCESS ON and functions that access the floatingpoint environment.
9.5.3 Honoring the floatingpoint environment
Ideally, functions would honor modes and signal exceptions just like the basic floatingpoint operations do. Highquality library functions approach this ideal. Such functions are implemented to behave in a manner consistent with whatever modes are in effect when the function is called.
Honoring trap enablement modes means taking enabled traps when and only when the function, considered as an atomic operation, encounters the corresponding exceptional condition. The following function computes the float sum of two squares:
float ssqf (float x, float y) {
return (double)x * x + (double)y * y;
}
Since the code evaluates to double precision, the calculation of the sum of squares cannot overflow or underflow. Any overflow or underflow in the conversion to float for the return value is appropriate to signal (because then the sum of squares really is too large or small for the float format). This code also correctly signals the other exceptions, even inexact.
For functions that produce correctly rounded results, the rounding modes have precisely defined effects. For other functions, the directed rounding modes (upward, downward, toward zero) have the general meaning that the delivered result is in the specified direction (upward, downward, toward zero) from the mathematical result. The ssqf function does not always produce a correctly rounded result, but it does honor the general meaning of the directed rounding modes. Most calculations do not naturally honor the directed rounding modes, and implementing them to do so can be a considerable challenge. Note that to round the expression f(x) – g(x) in one direction requires rounding g(x) in the opposite direction.
Coding to signal only those exceptions that are deserved is typically much trickier than the preceding example might suggest, especially if wide evaluation is not an option, as even the simple calculation of x*x + y*y below illustrates. Without wide evaluation, a product x*x or y*y might underflow even though the sum of the products is not small. The following computes a long double sum of squares and hides any underflow signals that occur in the calculation of a final result that is not subnormal or zero:
#include <fenv.h>
#include <float.h>
long double ssql(long double x, long double y) {
#pragma STDC FENV_ACCESS ON
fenv_t caller_env;
long double res;
feholdexcept(&caller_env); // save caller environment,
// disable all traps, and
// clear all flags
res = x*x + y*y;
if (!isless(res, LDBL_MIN)) feclearexcept(FE_UNDERFLOW);
feupdateenv(&caller_env); // restore caller environment,
// then resignal exceptions
return res;
}
The feholdexcept and feupdateenv functions work together to control the exceptions that the function exposes to its callers. The pragma is required in order to restrict the optimizer from moving the arithmetic instructions across the instructions that effect the environment changes. The !isless construct is used to handle NaN inputs in a reasonable way: the alluring alternative
if (res >= LDBL_MIN) feclearexcept(FE_UNDERFLOW);
would erroneously signal invalid if x or y is a (quiet) NaN and would erroneously fail to clear the underflow flag in the case that one of x or y is a NaN and the square of the other underflows.
See 6.3.2 regarding the performance implications of using #pragma STDC FENV_ACCESS ON and functions that access the floatingpoint environment.
9.6 Example: Calling C library functions from Fortran
To call a C math library function from a Fortran program, do the following:
 Use an
!$HP$ ALIAS directive to tell the compiler that the function’s arguments are passed by value.
 Declare the function with the correct return type. To see the C return type, click on the function name in Table 4.
 Link in the libm library. Linking with
f90 does this automatically.
The following Fortran program calls j0 , one of the Bessel functions in the C math library:
C bessel.f
!$HP$ ALIAS J0 = 'j0'(%VAL)
PROGRAM BESSEL
DOUBLE PRECISION A, B, J0
A = 1.0
B = J0(A)
WRITE(*,*)"Bessel of", A, "is", B
END
The %VAL argument indicates that the argument is passed by value. For details on the !$HP$ ALIAS directive, see the HP Fortran Programmer's Reference or the HP Fortran Programmer's Guide.
Compiling and running the program shows the following:
% f90 bessel.f
bessel.f
program BESSEL
10 Lines Compiled
% a.out
Bessel of 1.0 is 0.765197686557967
9.7 Example: Using C99 complex arithmetic
C99 features allow for more natural and efficient codes than is possible with more traditional complex arithmetic facilities. The following simple example illustrates:
Let z = x + y i, where x and y are real, and compute
w = (z – i) / (z + 2i) if (z – i) / (z + 2i) lies inside the unit circle;
w = 0 otherwise.
The code might contain
#include <complex.h>
double x, y;
double complex z, w;
...
z = x + y*I;
w = (z – I) / (z + 2*I);
if (cabs(w) > 1) w = 0;
A complex value is constructed from the real variables x and y by means of the natural notation x + y*I , which the compiler implements without the use of actual floatingpoint instructions. The quotient (z – I) / (z + 2*I) too is expressed in a mathematical style notation. Because I has imaginary type and the promotion rules do not require real and imaginary operands to be converted to complex, the numerator z  I and denominator z + 2*I each can be evaluated with just one real floatingpoint instruction. In contrast, more traditional promotion rules require that the product 2*I be promoted to (2+0i)(0+i), implying four real multiplies and two adds. Thus, C99 features for complex arithmetic allow natural, mathematical style notation and entail builtin executiontime efficiency.
C99 infinity properties guarantee that evaluation of the quotient (z – I) / (z + 2*I) at the pole z = –2i will yield a complex infinity. Without the infinity property, nonzero / zero = infinite, the quotient in this case might yield NaN+NaN i (and would, if computed with the ordinary division formula), and then cabs(w) > 1 would be false (and would signal the invalid exception), and the result from the code above would be incorrect. Without the infinity property, the programmer would have to add extra code to test for z = 2i and handle that case separately. Thus, C99 features for complex arithmetic facilitate simpler algorithms.
9.8 Example: Precise representation for floatingpoint constants and data
Some numerical codes depend on floatingpoint constants or data being precise to the last bit. A careful algorithm may require precise values to preserve mathematical properties, or to match assumptions used in error analysis. Precise values may be necessary for results that are reproducible with other programming or execution environments. Historically, creating such constants has been problematic, forcing programmers to resort to mechanisms like type casts and unions, which might not be portable. Two C99 features help solve this problem:
 hexadecimal floatingpoint representation
 correct rounding between binary floating types and decimal character sequences
9.8.1 Hexadecimal floatingpoint representation
A number in a floating type can be printed with hexadecimal representation to obtain a character sequence that can be incorporated into a source constant or ASCII input datum that will result in a value equivalent to the original number. For example, to produce a hexadecimal constant equal the float value sqrtf(2) , execute
#include <stdio.h>
#include <math.h>
int main() { printf("%a\n", sqrtf(2)); }
to obtain
0x1.6a09e60000000p+0
This character sequence can be incorporated into a float constant, 0x1.6a09e6p0F , or used as an input datum, to obtain the desired value. (Trailing zeros in the significand are optional.)
A hexadecimal floating constant produced as in the example above results in the same numerical value even if compiled using a wider evaluation method, which is not generally true for decimal floating constants (see 9.8.3). Likewise a hexadecimal floatingpoint ASCII datum will result in the same value if scanned into a wider floatingpoint object.
9.8.2 Correct rounding between binary floating types and decimal character sequences
C99 Annex F requires correct rounding for conversions between IEEE floating types and decimal character sequences with at most DECIMAL_DIG significant digits. DECIMAL_DIG is defined in <float.h > to be a number of decimal digits sufficient to distinguish all numbers in the widest supported IEEE binary floating type. On HPUX, DECIMAL_DIG is 36, which is sufficient to distinguish all long double (quad ) numbers.
Correct rounding implies a useful fact for devising decimal floating constants or ASCII input data: a number in a (binary) floatingpoint type can be converted to a decimal character sequence that can be converted back to recover the original number, provided the decimal character sequence contains enough significant digits. Table 5 shows the requisite number of significant decimal digits for each HPUX floating type.
Table 15: Number of digits to distinguish floating point numbers
HPUX floating type 
Significant decimal digits 
float 
9 
double 
17 
extended 
21 
long double (quad) 
36 
To produce a decimal floating constant equal the float value sqrtf(2) , execute
#include <stdio.h>
#include <math.h>
int main() { printf("%.8e\n", sqrtf(2)); }
// 9 significant digits
to obtain
1.41421354e+00
This character sequence can be incorporated into a float constant, 1.41421354F , to obtain the desired result, provided a wide evaluation method is not in effect. Similarly, 1.41421354 will produce the desired result when converted into an object of type float, but not wider.
9.8.3 Wide evaluation effects on floating decimal constants
Wide evaluation methods widen the evaluation format for floating constants of a narrower type, which generally produces a different value. For example, if FP_EVAL_METHOD == 0 , the HPUX default, then constant 0.1F has the float value nearest to one tenth. But if FP_EVAL_METHOD == 1 , as under the HPUX option fpeval=double , then 0.1F has the double value nearest to one tenth, which is a more accurate representation of one tenth. The more accurate wider value is generally more useful, though not if an exact narrower value is required.
Floating decimal constants can be chosen to produce the same numerical value even when wide evaluation renders them in a wider format. The floating constants in <float.h > are chosen this way: thus, for example, FLT_MAX will have the maximum finite value in the float type regardless of the evaluation method. Alternatively, floating decimal constants can be chosen to take advantage of wide evaluation to produce a more accurate representation of a mathematical value. Examples below show both approaches.
To obtain a floating decimal constant that would not be affected by wider evaluation methods, use enough decimal digits to distinguish the desired value in the widest evaluation format. On HPUX, the widest evaluation format is extended . Table 15 above shows that 21 decimal digits distinguish extended numbers. The following examples use 36 decimal digits, which are enough to distinguish even quad numbers.
To produce a float constant equal to the float value sqrtf(2) , for any evaluation format precision up to quad , execute
#include <stdio.h>
#include <math.h>
int main() { printf("%.35e\n", sqrtf(2)); } // 36 significant digits
to obtain
1.41421353816986083984375000000000000e+00
The float constant 1.41421353816986083984375F serves the purpose. As the trailing zeros might suggest, the float value sqrtf(2) is exactly 1.41421353816986083984375.
Typically, a more accurate representation of a mathematical value is more helpful than a less accurate one, and having a value that is independent of the evaluation method is not necessary in most cases. In the following code, sqrtl(2) produces the best possible long double (quad ) approximation of √2 (sqrtl is a correctly rounded function). Printing with 36 significant decimal digits produces a decimal character sequence that will convert back to this same best long double approximation.
#include <stdio.h>
#include <math.h>
int main() { printf("%.35Le\n", sqrtl(2)); }
produces
1.41421356237309504880168872420969798e+00
1.41421356237309504880168872420969798F is a floating constant, of semantic type float , that will evaluate to an approximation of √2 accurate to the full precision of the evaluation format.
An L instead of F suffix would give the constant the semantic type long double . Then the constant would produce the best long double approximation of √2 regardless of the evaluation method. However, a long double constant might cause operations and functions to be evaluated to long double (even with narrow evaluation methods), which might have undesirable performance implications.
In rare cases, use of extra digits to express a floating decimal constant results in a very slightly less accurate value. This effect is similar to the double rounding errors that can occur with wide evaluation. See 9.4.
9.8.4 Constants for mathematical expressions
In the examples above, the square root functions were used to calculate correctlyrounded approximations of √2 as values in floating types. These values were printed to obtain floating decimal constants or data that would result in precise approximations of √2. A program might require floating constants representing more complicated expressions involving several operations and function values. The calculation of such an expression may suffer multiple roundoff errors. Also, most math library functions are not correctly rounded for all inputs (see 5.5). The accuracy of the calculated expression, hence of the printed decimal character sequence, may be exceedingly difficult to estimate. In many cases, calculating to wider precision will produce a value that can be printed to obtain a constant or datum, for a narrower type, that is correctly rounded, or very nearly. See 7.
9.9 Example: Technique for assessing roundingoff
As noted in 3.3, roundoff error is inherent in floatingpoint arithmetic. In Reference 9, W. Kahan discusses various schemes for assessing the effects of roundoff. He argues that only thorough error analysis can be foolproof, but notes that error analysis for reallife problems is generally not practical. Among all approaches, he deems the following the most cost effective: “Repeat the computation in arithmetic of the same precision but rounded differently, say Down, and then Up, and maybe Towards Zero too, and compare three or four results.” Inordinately different results suggest that the problem, or the method of computation, is susceptible to serious roundoff error.
To support this technique for assessing the effects of roundoff, HP provides a compile/link option to install any of the IEEE rounding modes at program startup, as shown in table 15:
Table 16: Presetting the rounding mode
Rounding 
Option 
to nearest 
+FP RN 
upward 
+FP RU 
downward 
+FP RD 
toward zero 
+FP RZ 
Note that the +FP option affects the link and has no effect on c compilations that produce only an object file. (The +FP option can also be used to preenable trapping or sudden underflow mode.)
To demonstrate the technique, consider the following program which computes the roots of the quadratic equation 1.2x^{2} – 1.1x – 10^{7} = 0 using the familiar quadratic formula:
#include <math.h>
#include <stdio.h>
int main() {
float a = 1.2, b = 1.1, c = 1e7, r1, r2;
r1 = (b + sqrtf(b*b  4*a*c)) / (2*a);
r2 = (b  sqrtf(b*b  4*a*c)) / (2*a);
printf("r1 = %g\n", r1);
printf("r2 = %g\n", r2);
}
Building and running the program, preenabling each of the rounding modes in turn, produces
% cc unstable.c c
% cc unstable.o lm +FP RN ; a.out
r1 = 0.916667
r2 = 9.93411e08
% cc unstable.o lm +FP RU ; a.out
r1 = 0.916667
r2 = 1.49011e07
% cc unstable.o lm +FP RD ; a.out
r1 = 0.916666
r2 = 4.96706e08
% cc unstable.o lm +FP RZ ; a.out
r1 = 0.916666
r2 = 4.96705e08
The extreme inconsistency of r2 suggests a serious roundoff problem. The next program uses a method that is known to be stable for computing the roots:
#include <math.h>
#include <stdio.h>
int main() {
float a = 1.2, b = 1.1, c = 1e7, r1, r2;
r1 = (b  copysignf(sqrtf(b*b  4*a*c), b)) / (2*a);
r2 = c/(a*r1);
printf("r1 = %g\n", r1);
printf("r2 = %g\n", r2);
}
With this program, varying the rounding modes has little effect on the results:
% cc stable.c c
% cc stable.o lm +FP RN ; a.out
r1 = 0.916667
r2 = 9.09091e08
% cc stable.o lm +FP RU ; a.out
r1 = 0.916667
r2 = 9.0909e08
% cc stable.o lm +FP RD ; a.out
r1 = 0.916666
r2 = 9.09092e08
% cc stable.o lm +FP RZ ; a.out
r1 = 0.916666
r2 = 9.0909e08
(Note that the numerically unstable quadratic formula computed an erroneous r2 in all rounding modes.)
Examples 9.1 and 9.2 show how use of wider types for intermediate computation can reduce the effects of roundoff errors. The next program employs the unstable quadratic formula, but attempts to avoid the roundoff problems by using wider precision:
#include <tgmath.h>
#include <stdio.h>
int main() {
float a = 1.2, b = 1.1, c = 1e7, r1, r2;
r1 = (b + sqrt((float_t)(b*b  4*a*c))) / (2*a);
r2 = (b  sqrt((float_t)(b*b  4*a*c))) / (2*a);
printf("r1 = %g\n", r1);
printf("r2 = %g\n", r2);
}
The (float_t) casts are necessary so that the typegeneric sqrt will invoke a square root routine of the evaluation type. In this example, evaluation to double (using HP’s –fpeval=double option) successfully reduces roundoff effects:
% cc stable2.c c
% cc stable2.o lm fpeval=double +FP RN ; a.out
r1 = 0.916667
r2 = 9.09091e08
% cc stable2.o lm fpeval=double +FP RU ; a.out
r1 = 0.916667
r2 = 9.0909e08
% cc stable2.o lm fpeval=double +FP RD ; a.out
r1 = 0.916666
r2 = 9.09091e08
% cc stable2.o lm fpeval=double +FP RZ ; a.out
r1 = 0.916666
r2 = 9.0909e08
As shown in the examples above, to change the startup rounding mode, it suffices to relink with the appropriate +FP option; recompiling is not required.


» Return to top 

Links 


References 


1. 
Accuracy and Stability of Numerical Algorithms, by N. Higham, Society for Industrial and Applied Mathematics (SIAM), 1996 
2. 
ANSI/IEEE Std 7541985, IEEE Standard for Binary FloatingPoint Arithmetic 
3. 
ANSI/ISO/IEC 15391, 1997, Information Technology  Programming Languages  Fortran (Fortran90) 
4. 
ANSI/ISO/IEC 14882:1998, Programming languages  C++ 
5. 
ANSI X3.1591989, Programming Language C (C89) 
6. 
“Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit,” by W. Kahan in The State of the Art in Numerical Analysis, (eds. Iserles and Powell), Clarendon Press, Oxford, 1987 
7. 
“The Computation of Transcendental Functions on the IA64 Architecture,” by J. Harrison, T. Kubaska, S. Story, and P. Tang, Intel Technology Journal, Q4, 1999 (143.0KB, PDF) 
8. 
Elementary Functions: Algorithms and Implementation, by JM Muller; 2nd ed., Birkhäuser, Boston (2006) 
9. 
“How Futile are Mindless Assessments of Roundoff in FloatingPoint Computation?” by W. Kahan (375.0KB, PDF) 
10. 
“How JAVA's FloatingPoint Hurts Everyone Everywhere,” by W. Kahan (274.0KB, PDF) 
11. 
HP C/ANSI C Developer's Bundle, HewlettPackard Company 
12. 
HP aC++, HewlettPackard Company 
13. 
“HP Compilers for Itaniumbased HPUX,” HewlettPackard Company (403.0KB, PDF) 
14. 
HP Fortran compilers, HewlettPackard Company 
15. 
HP Fortran Programmer's Guide, HewlettPackard Company 
16. 
HP Fortran Programmer's Reference, HewlettPackard Company 
17. 
HP MLIB User's Guide, HewlettPackard Company 
18. 
HPUX floatingpoint guide (HPUX 11.0), HewlettPackard Company (812.0KB, PDF) 
19. 
IA64 and Elementary Functions, Speed and Precision, by P. Markstein, Prentice Hall PTR, 2000 
20. 
“IA64 FloatingPoint Operations and the IEEE Standard for Binary FloatingPoint Arithmetic,” by M. CorneaHasegan and B. Norin, Intel Technology Journal, Q4, 1999 (565.0KB, PDF) 
21. 
IEC 60559:1989, Binary floatingpoint arithmetic for microprocessor systems (previously designated IEC 559:1989) 
22. 
“Inline assembly for Itaniumbased HPUX,” HewlettPackard Company (356.0KB, PDF) 
23. 
“Inlining of Mathematical Functions in HPUX for Itanium® 2,” by J. Thomas, Proceedings of the 2003 CGO, The International Symposium on Code Generation and Optimization, IEEE Computer Society, 2003 
24. 
Intel® Itanium® Architecture, Software Developer's Manual, Intel Corporation 
25. 
ISO/IEC C  Approved standards (C99) 
26. 
“Marketing versus Mathematics,” by W. Kahan (104.0KB, PDF) 
27. 
“OpenVMS floatingpoint arithmetic on the Intel Itanium architecture,” HewlettPackard Company (533.0KB, PDF) 
28. 
The Single UNIX® Specification, Version 2 (UNIX 95), The Open Group, 1997 
29. 
The Single UNIX® Specification, Version 3 (UNIX 03), The Open Group, 2002 
30. 
Software Optimization for High Performance Computing, by K. Wadleigh and I. Crawford, Prentice Hall PTR, 2000 
31. 
C/C++ Programmer's Guide and Common RunTime Environment (CRE) Programmer's Guide, NonStop Technical Library, HewlettPackard Company 
32. 
Decimal floating point for HPUX, HewlettPackard Company 

» Return to top 










