What Every Computer Scientist Should Know About Floating Point Arithmetic.pdf

(3645 KB) Pobierz
What
Every Computer
Scientist
Should
Know
About
Floating-Point
Arithmetic
DAVID GOLDBERG
Alto Research
3333
Xerox
Palo
Center,
Coyote
Hill
Road,
Palo
Alto,
CalLfornLa
94304
Floating-point
arithmetic
is considered
an esotoric
subject
by
many
people.
This
is
rather
surprising,
because
floating-point
is ubiquitous
in
computer
systems:
Almost
every
language
has a floating-point
datatype;
computers
from
PCs to supercomputers
have
floating-point
accelerators;
most
compilers
will
be called
upon
to compile
floating-point
algorithms
from
time
to time;
and virtually
every
operating
system
must
respond
to floating-point
exceptions
such as overflow
This
paper
presents
a tutorial
on
the
aspects
of floating-point
that
have
a direct
impact
on designers
of computer
systems.
It
begins
with
background
on floating-point
representation
and
rounding
error,
continues
with
a discussion
of the
IEEE
floating-point
standard,
and
concludes
with
examples
of how
computer
system
builders
can better
support
floating
point,
Categories
and Subject
Descriptors:
(Primary)
C.0 [Computer
Systems
Organization]:
General–
instruction
set design;
[Programming
Languages]:
D.3.4
Processors —compders,
G. 1.0 [Numerical
Analysis]:
General—computer
optirruzatzon;
arithmetic,
error
analysis,
numerzcal
algorithms
(Secondary)
D. 2.1 [Software
Engineering]:
Requirements/Specifications–
languages;
D, 3.1 [Programming
Languages]:
Formal
Definitions
and Theory
—semantZcs
D ,4.1 [Operating
Systems]:
Process
Management—synchronization
General
Terms:
Algorithms,
Design,
Languages
Additional
Key
Words
and
Phrases:
denormalized
number,
exception,
floating-point,
floating-point
standard,
gradual
underflow,
guard
digit,
NaN,
overflow,
relative
error,
rounding
error,
rounding
mode,
ulp,
underflow
tions
of
addition,
subtraction,
multipli-
INTRODUCTION
cation,
and
division.
It
also
contains
Builders
of computer
systems
often
need
background
information
on
the
two
information
about
floating-point
arith-
methods
of
measuring
rounding
error,
metic.
There
are
however,
remarkably
and
The
second
part
ulps
relative
error.
few
sources
of detailed
information
about
discusses
the
IEEE
floating-point
stand-
it.
One
of the
few
books
on the
subject,
ard,
which
is becoming
rapidly
accepted
Floating-Point
Computation
by
Pat
Ster-
by
commercial
hardware
manufacturers.
benz,
is long
out
of print.
This
paper
is a
Included
in
the
IEEE
standard
is
the
tutorial
on those
aspects
of floating-point
rounding
method
for
basic
operations;
arithmetic
( floating-point
hereafter)
that
therefore,
the
discussion
of the
standard
have
a
direct
connection
to
systems
draws
on the
material
in
Section
1. The
building.
It
consists
of three
loosely
con-
third
part
discusses
the
connections
be-
nected
parts.
The
first
(Section
1)
dis-
tween
floating
point
and
the
design
of
cusses the
implications
of using
different
various
aspects
of
computer
systems.
rounding
strategies
for
the
basic
opera-
Topics
include
instruction
set
design,
Permission
to copy without
fee all
or part
of this
material
is granted
provided
that
the
copies are not
made
or distributed
for
direct
commercial
advantage,
the
ACM
copyright
notice
and the
title
of the
publication
and
its
data
appear,
and
notice
is given
that
copying
is by permission
of the
Association
for
Computing
Machinery.
To copy otherwise,
or to republish,
requires
a fee and/or
specific
permission.
@ 1991 ACM
0360-0300/91/0300-0005
$01.50
ACM
Computing
Surveys, Vol
23, No
1, March
1991
736029154.002.png
.
6
David
Goldberg
CONTENTS
fit
back
into
its finite
representation.
The
1
resulting
rounding
error
is the
character-
istic
feature
of
floating-point
computa-
tion.
Section
1.2
describes
how
it
is
measured.
Since
INTRODUCTION
1
ROUNDING ERROR
1 1 Floating-Point
most
floating-point
calculations
Formats
have
rounding
error
anyway,
does
it
12
Relatlve
Error
and Ulps
matter
if the
basic
arithmetic
operations
1 3 Guard Dlglts
14 Cancellation
1 5 Exactly
introduce
a bit
more
rounding
error
than
Rounded
Operations
necessary?
That
question
is a main
theme
2
IEEE STANDARD
2 1 Formats
throughout
Section
1.
Section
1.3
dis-
and Operations
cusses
a means
of reducing
guard
digits,
22
S~eclal Quantltles
the
error
when
subtracting
two
nearby
23
Exceptions,
Flags,
and
Trap
Handlers
numbers.
Guard
digits
were
considered
SYSTEMS ASPECTS
3 1 Instruction
3
Sets
sufficiently
important
by
IBM
that
in
32
Languages
and Compders
1968
it
added
a guard
digit
to the
double
33
Exception
Handling
precision
format
in
the
System/360
ar-
4
DETAILS
4 1 Rounding
chitecture
(single
precision
already
had
a
Error
guard
digit)
and
retrofitted
all
existing
42
Bmary-to-Decimal
Conversion
4 3
Errors
in
Summatmn
machines
in
the
field.
Two
examples
are
5 SUMMARY
APPENDIX
ACKNOWLEDGMENTS
REFERENCES
given
to
illustrate
the
utility
of
guard
digits.
The
IEEE
standard
goes further
than
just
requiring
the
use of a guard
digit.
It
gives
an algorithm
for
addition,
subtrac-
tion,
multiplication,
division,
and
square
optimizing
compilers,
and
exception
root
and
requires
that
implementations
handling.
All
produce
the
same
result
as
that
algo-
the
statements
made
about
float-
rithm.
Thus,
when
a program
is
moved
ing-point
are provided
with
justifications,
from
one machine
to another,
the
results
but
those
explanations
not
central
to the
of the
basic
operations
will
be the
same
main
argument
are
in
a
section
called
in
every
bit
if both
machines
support
the
The
Details
and
can
be
skipped
if
de-
IEEE
standard.
This
greatly
simplifies
sired.
In particular,
the
proofs
of many
of
the
porting
of
programs.
Other
uses
of
the
theorems
appear
in
this
section.
The
this
precise
specification
are
given
in
end
of each
m-oof is marked
with
the
H
Section
1.5.
symbol;
whe~
a proof
is not
included,
the
q
appears
immediately
following
the
statement
of the
theorem.
2.1
Floating-Point
Formats
Several
different
representations
of real
1.
ROUNDING
ERROR
numbers
have
been
proposed,
but
by
far
the
most
widely
used is the
floating-point
Squeezing
infinitely
many
real
numbers
representation.’
Floating-point
represen-
into
a finite
number
of bits
requires
an
tations
have
a base
O (which
is
always
approximate
representation.
Although
assumed
to be even)
and
a precision
If
there
are
infinitely
many
integers,
in
p.
most
programs
the
result
of integer
com-
6 = 10 and
number
0.1 is rep-
p
=
3,
the
resented
as
1.00
x
10-1.
If
P = 2
and
putations
can
be
stored
in
32
bits.
In
contrast,
given
any
fixed
number
of bits,
number
0.1
cannot
P =
24,
the
decimal
most
calculations
with
real
numbers
will
produce
quantities
that
cannot
be exactly
represented
using
that
many
bits.
There-
lExamples
of
other
representations
are
floatzng
fore,
the
result
of a floating-point
calcu-
slas;,
aud
szgned logan
th m [Matula
and
Kornerup
lation
must
often
be rounded
in
order
to
1985;
Swartzlander
and
Alexopoulos
1975]
ACM
Computing
Surveys,
Vol
23,
No
1, March
1991
736029154.003.png
Floating-Point
Arithmetic
7
l
100X22
101X22
11 O X221.11X22
[,!,1
I
I
,
,
I
i
I
r
\
c
r
o
1
2
4
5
6
7
3
Figure
1.
Normalized
numbers
when
(3 = 2, p
=
3, em,n =
– 1, emax = 2.
o ‘m= or smaller
be
represented
exactly
but
is
approxi-
than
1.0
x ~em~. Most
of
this
paper
discusses
issues
due
to
the
mately
1.10011001100110011001101
x
first
reason.
Numbers
that
are
out
of
2-4.
In
general,
a
floating-point
num-
range
will,
however,
be discussed
in
Sec-
ber
will
be represented
as
~ d. dd
“ . .
d
tions
2.2.2
and
2.2.4.
where
called
the
x /3’,
d. dd
. . .
d
is
Floating-point
representations
are
not
and
has
digits.
More
pre-
significand2
p
necessarily
unique.
For
example,
both
cisely,
kdO.
dld2
“.”
dp_l
x
b’
repre-
0.01
and
1.00
represent
sents
the
number
101
10-1
x
x
0.1.
If the
leading
digit
is nonzero
[ do
#
O
in
eq.
(1)],
the
representation
is
said
to
+
do +
dl~-l
+
““.
+dP_l&(P-l))&,
(
be
The
floating-point
num-
normalized.
o<(il
<~.
(1)
ber
1.00
x
10-1
is
normalized,
whereas
0.01
x
101
is
not.
When
~ = 2,
p
=
3,
The
term
will
floating-point
number
1, and
e~~X = 2,
there
are
16
e~i~
=
be used
to mean
a real
number
that
can
normalized
floating-point
numbers,
as
be exactly
represented
in
the
format
un-
shown
in
Figure
1. The
bold
hash
marks
der
discussion.
Two
other
parameters
correspond
to numbers
whose
significant
associated
with
floating-point
represen-
is
1.00.
Requiring
that
a floating-point
tations
are
the
largest
and
smallest
al-
representation
be normalized
makes
the
lowable
exponents,
e~~X and
e~,~.
Since
representation
unique.
Unfortunately,
there
are
(3P possible
significands
and
this
restriction
makes
it
impossible
to
e max
e~i.
+
1
possible
exponents,
a
represent
zero!
A
natural
way
to
repre -
floating-point
number
can
be encoded
in
sent
O is
with
1.0
x
~em~- 1, since
this
+ [log2((3J’)]
+
1
preserves
the
fact
that
the
numerical
or-
+
1)]
L 1°g2
‘ma.
‘m,.
its,
where
the
final
+ 1 is for
the
sign
dering
of nonnegative
real
numbers
cor-
bit.
The
precise
encoding
is
not
impor-
responds
to
the
lexicographical
ordering
tant
for
now.
of
their
floating-point
representations.
3
There
are two
reasons
why
a real
num-
When
the
exponent
is
stored
in
a
k
bit
ber might
not be exactly
representable
as
field,
that
means
that
only
2 k –
1 values
a floating-point
number.
The
most
com-
are
available
for
use as exponents,
since
mon
situation
is illustrated
by
the
deci-
one must
be reserved
to represent
O.
mal
number
0.1.
Although
it has
a finite
Note
that
the
in
a
floating-point
x
decimal
representation,
in
binary
it
has
number
is part
of the
notation
and
differ-
an
infinite
repeating
representation.
ent
from
a floating-point
multiply
opera-
Thus,
when
D = 2, the
number
0.1
lies
tion.
The
meaning
of
the
x
symbol
strictly
between
two
floating-point
num-
should
be
clear
from
the
context.
For
bers
and
is exactly
representable
by nei-
example,
the
expression
(2.5
x
10-3,
x
ther
of them.
A less common
situation
is
(4.0
102)
involves
only
a single
float-
X
that
a real
number
is out
of range;
that
ing-point
multiplication.
is,
its
absolute
value
is larger
than
f? x
2This
term
was
introduced
by
Forsythe
and
Moler
[196’71 and
has generally replaced the older term
3This
assumes
the
usual
arrangement
where
the
exponent
is stored
to the
left
of the
significant
mantissa.
ACM
Computing
Surveys, Vol.
23, No
1, March
1991
736029154.004.png
8*
David
Goldberg
That
is,
x
/3’/~’+1.
1.2
Relative
Error
and
Ulps
Since
rounding
error
is inherent
in float-
:(Y’
s
;Ulp
s
;6-’.
(2)
ing-point
computation,
it
is important
to
2
have
a way
to
measure
this
error.
Con-
sider
the
floating-point
format
with
~ =
~
n
particular,
the
relative
error
corre -
10
and
which
will
be
used
p
=
3,
spending
to
1/2
ulp
can
vary
by
a factor
throughout
this
section.
If the
result
of a
of
O. This
factor
is
called
the
wobble.
floating-point
computation
is 3.12
x
10’2
Setting
E = (~ /2)~-P
to
the
largest
of
and
the
answer
when
computed
to
infi-
the
bounds
in (2), we can say that
when
a
nite
precision
is
.0314,
it
is
clear
that
real
number
is
rounded
to
the
closest
this
is
in
error
by
2
units
in
the
last
number,
the
relative
error
floating-point
place.
Similarly,
if
the
real
number
is always
bounded
by
c, which
is referred
.0314159
is represented
as 3.14
x
10-2,
to as machine
epsilon
then
it
is
in
error
by
.159
units
in
the
In
the
example
above,
the
relative
er-
last
place.
In general,
if the
floating-point
~or was
.oo159i3,
~4159
= 0005.
To avoid
number
is used
to repre-
d. d
. . .
d
x
fle
such
small
numbers,
the
relative
error
is
sent
z,
it
is
in
error
by
Id.
d
. . .
d–
normally
written
as
a
factor
times
6,
I flp - 1 units
in the
last
place.4
The
( z//3’)
which
in
this
case
is
c = (~/2)P-P
=
term
will
be used
as shorthand
for
ulps
Thus,
the
relative
error
5(10)
-3
=
.005.
“units
in
the
last
place. ”
If
the
result
of
would
be
expressed
as
((.00159/
a calculation
is
the
floating-point
num -
3.14159)
/.oo5)e
= O.l E.
ber
nearest
to
the
correct
result,
it
still
To
illustrate
the
difference
between
might
be in error
by as much
as 1/2
ulp.
ulps
and
relative
error,
consider
the
real
Another
way
to measure
the
difference
number
x = 12.35.
It
is approximated
by
between
a floating-point
number
and
the
Z =
1.24
x
101. The
error
is 0.5 ulps;
the
real
number
it
is approximating
is
rela-
relative
error
is
0.8 e. Next
consider
the
which
is
the
difference
be-
tive
error,
computation
8x.
The
exact
value
is 8 x =
tween
the
two
numbers
divided
by
the
98.8,
whereas,
the
computed
value
is 81
real
number.
For
example,
the
relative
= 9.92
x
101. The
error
is now
4.0
ulps,
error
committed
when
approximating
but
the
relative
error
is
still
0.8 e. The
3.14159
by 3.14
10°
is .00159 /3.14159
x
error
measured
in
ulps
is
eight
times
=
.0005.
larger,
even
though
the
relative
error
is
To compute
the
relative
error
that
cor-
the
same.
In
general,
when
the
base is (3,
responds
to 1/2
ulp,
observe
that
when
a
a fixed
relative
error
expressed
in
ulps
real
number
is
approximated
by
the
can
wobble
by
a factor
of up
to
(3. Con-
closest
possible
floating-point
number
versely,
as eq. (2) shows,
a fixed
error
of
P
1/2
ulps
results
in
a relative
error
that
~e, the
absolute
error
can be
d dd
~. dd
X
can wobble
by
(3.
u
The
most
natural
way
to
measure
as large
as ‘(Y
where
&
is
x
/3’
rounding
error
is
in
ulps.
For
example,
the
digit
~/2.
This
error
is ((~/2)&P)
x
/3’
rounding
to
the
neared
flo~ting.point
of
the
form
Since
numb...
d.
dd
---
number
corresponds
to
1/2
ulp.
When
/3e all
have
this
same absolute
error
dd
x
analyzing
the
rounding
error
caused
by
but
have
values
that
range
between
~’
various
formulas,
however,
relative
error
and
O x
the
relative
error
ranges
be-
fle,
is
a better
measure.
A
good
illustration
tween
((&/2 )~-’)
and
((~/2)&J’)
x
/3’//3’
of
this
is
the
analysis
immediately
fol-
lowing
the
proof
of Theorem
10.
Since
~
can
overestimate
the
effect
of rounding
to
the
nearest
floating-point
number
b;
4Un1ess
the
number
z
is
larger
than
~em=+ 1 or
the
wobble
factor
of (3, error
estimates
of
smaller
than
(lem~. Numbers
that
are
out
of range
formulas
will
be tighter
on machines
with
in
this
fashion
will
not
be considered
until
further
a small
p.
notice.
ACM
Computmg
Surveys,
Vol.
23, No
1, March
1991
736029154.005.png
Floating-Point
Arithmetic
g
9
When
only
the
order
of magnitude
of
wrong
in
every
digit!
How
bad
can
the
rounding
error
is of interest,
ulps
and
e
error
be?
may
be used
interchangeably
since
they
differ
by at most
a factor
of ~. For
exam-
Theorem 1
ple,
when
a floating-point
number
is
in
Using
a
floating-point
format
with
pa-
error
by
n ulps,
that
means
the
number
rameters
/3 and
p
and
computing
differ-
of
contaminated
digits
is
logD n.
If
the
ences
using
p
digits,
the
relative
error
of
relative
error
in
a
computation
is
ne,
the
result
can
be as large
as b –
1.
then
A
relative
error
of
13–
1
in
Proofi
contaminated
digits
= log,n.
(3)
the
expression
x – y
occurs
when
x =
1.00”””
Oandy=.
pp.
””p,
wherep=
1.3 Guard
Digits
@–
1. Here
y
has
p
digits
(all
equal
to
Q). The
exact
difference
is
x – y = P ‘p.
One
method
of computing
the
difference
When
computing
the
answer
using
only
between
two
floating-point
numbers
is to
p
digits,
however,
the
rightmost
digit
of
compute
the
difference
exactly,
then
y gets
shifted
off,
so the
computed
differ-
round
it
to
the
nearest
floating-point
ence
is
P–p+l.
Thus,
the
error
is
p-p
number.
This
is
very
expensive
if
the
~-P(~
1),
and
the
relative
er-
@-P+l
=
operands
differ
greatly
in
size. Assuming
ror
is
$-P((3
l)/O-p
= 6 –
1.
H
P = 3, 2,15
1012 –
1.25
10-5
would
X
X
be calculated
as
When
f? = 2, the
absolute
error
can be
as large
as the
result,
and
when
13= 10,
x = 2.15
1012
X
it
can
be
nine
times
larger.
To
put
it
y =
.0000000000000000125
1012
X
another
way,
when
(3 = 2, (3) shows
that
X – y
= 2.1499999999999999875
1012,
the
number
of
contaminated
digits
is
X
log2(l/~)
= logJ2
J’) = p.
That
is,
all
of
which
rounds
to 2.15
x
1012. Rather
than
the
p digits
in the
result
are wrong!
using
all
these
digits,
floating-point
Suppose
one
extra
digit
is
added
to
hardware
normally
operates
on
a
fixed
guard
against
this
situation
(a
guard
number
of digits.
Suppose
the
number
of
That
is,
the
smaller
number
is
digit).
digits
kept
is
p
and
that
when
the
truncated
to
p
+
1 digits,
then
the
result
smaller
operand
is
shifted
right,
digits
of the
subtraction
is rounded
to
digits.
p
are
simply
discarded
(as
opposed
to
With
a guard
digit,
the
previous
example
rounding).
Then,
2.15
x
1012 –
1.25
x
becomes
10-5
becomes
x = 1.010
x
101
x = 2.15
1012
X
y = 0.993
101
x
‘y = 0.00
1012
x
x–y=
.017
101,
x
x–y
=2.15x1012.
and
the
answer
is
exact.
With
a single
The
answer
is exactly
the
same
as if the
guard
digit,
the
relative
error
of the
re -
difference
had
been
computed
exactly
suit
may
be greater
than
~, as
in
110 –
then
rounded.
Take
another
example:
8.59:
10.1
– 9.93.
This
becomes
x=
1.1OX
102
x=
1.01
101
x
y =
.085
102
X
‘y = 0.99
101
x
1.015
x
102
z–y=
.02 x
101.
X–yz
This
rounds
to
102,
compared
with
the
The
correct
answer
is
.17,
so the
com-
correct
answer
of
101.41,
for
a relative
puted
difference
is off
by
30 ulps
and
is
error
of
.006,
which
is
greater
than
ACM
Computing
Surveys, Vol
23, No. 1, March
1991
736029154.001.png
Zgłoś jeśli naruszono regulamin