What Every Computer Scientist Should Know About Floating Point Arithmetic.pdf

What

Every Computer

Scientist

Should

Know

About

Floating-Point

Arithmetic

DAVID GOLDBERG

Alto Research

3333

Xerox

Palo

Center,

Coyote

Hill

Road,

Palo

Alto,

CalLfornLa

94304

Floating-point

arithmetic

is considered

an esotoric

subject

many

people.

This

rather

surprising,

because

floating-point

is ubiquitous

computer

systems:

Almost

every

language

has a floating-point

datatype;

computers

from

PCs to supercomputers

have

floating-point

accelerators;

most

compilers

will

be called

upon

to compile

floating-point

algorithms

from

time

to time;

and virtually

every

operating

system

must

respond

to floating-point

exceptions

such as overflow

This

paper

presents

a tutorial

the

aspects

of floating-point

that

have

a direct

impact

on designers

of computer

systems.

begins

with

background

on floating-point

representation

and

rounding

error,

continues

with

a discussion

of the

IEEE

floating-point

standard,

and

concludes

with

examples

of how

computer

system

builders

can better

support

floating

point,

Categories

and Subject

Descriptors:

(Primary)

C.0 [Computer

Systems

Organization]:

General–

instruction

set design;

[Programming

Languages]:

D.3.4

Processors —compders,

G. 1.0 [Numerical

Analysis]:

General—computer

optirruzatzon;

arithmetic,

error

analysis,

numerzcal

algorithms

(Secondary)

D. 2.1 [Software

Engineering]:

Requirements/Specifications–

languages;

D, 3.1 [Programming

Languages]:

Formal

Definitions

and Theory

—semantZcs

D ,4.1 [Operating

Systems]:

Process

Management—synchronization

General

Terms:

Algorithms,

Design,

Languages

Additional

Key

Words

and

Phrases:

denormalized

number,

exception,

floating-point,

floating-point

standard,

gradual

underflow,

guard

digit,

NaN,

overflow,

relative

error,

rounding

error,

rounding

mode,

ulp,

underflow

tions

addition,

subtraction,

multipli-

INTRODUCTION

cation,

and

division.

also

contains

Builders

of computer

systems

often

need

background

information

the

two

information

about

floating-point

arith-

methods

measuring

rounding

error,

metic.

There

are

however,

remarkably

and

The

second

part

ulps

relative

error.

few

sources

of detailed

information

about

discusses

the

IEEE

floating-point

stand-

it.

One

of the

few

books

on the

subject,

ard,

which

is becoming

rapidly

accepted

Floating-Point

Computation

Pat

Ster-

commercial

hardware

manufacturers.

benz,

is long

out

of print.

This

paper

is a

Included

the

IEEE

standard

the

tutorial

on those

aspects

of floating-point

rounding

method

for

basic

operations;

arithmetic

( floating-point

hereafter)

that

therefore,

the

discussion

of the

standard

have

direct

connection

systems

draws

on the

material

Section

1. The

building.

consists

of three

loosely

con-

third

part

discusses

the

connections

be-

nected

parts.

The

first

(Section

dis-

tween

floating

point

and

the

design

cusses the

implications

of using

different

various

aspects

computer

systems.

rounding

strategies

for

the

basic

opera-

Topics

include

instruction

set

design,

Permission

to copy without

fee all

or part

of this

material

is granted

provided

that

the

copies are not

made

or distributed

for

direct

commercial

advantage,

the

ACM

notice

and the

title

of the

publication

and

its

data

appear,

and

notice

is given

that

copying

is by permission

of the

Association

for

Computing

Machinery.

To copy otherwise,

or to republish,

requires

a fee and/or

specific

permission.

@ 1991 ACM

0360-0300/91/0300-0005

$01.50

ACM

Computing

Surveys, Vol

23, No

1, March

1991

David

Goldberg

CONTENTS

fit

back

into

its finite

representation.

The

resulting

rounding

error

is the

character-

istic

feature

floating-point

computa-

tion.

Section

1.2

describes

how

measured.

Since

INTRODUCTION

ROUNDING ERROR

1 1 Floating-Point

most

floating-point

calculations

Formats

have

rounding

error

anyway,

does

Relatlve

Error

and Ulps

matter

if the

basic

arithmetic

operations

1 3 Guard Dlglts

14 Cancellation

1 5 Exactly

introduce

a bit

rounding

error

than

Rounded

Operations

necessary?

That

question

is a main

theme

IEEE STANDARD

2 1 Formats

throughout

Section

1.3

dis-

and Operations

cusses

a means

of reducing

guard

digits,

S~eclal Quantltles

the

error

when

subtracting

two

nearby

Exceptions,

Flags,

and

Trap

Handlers

numbers.

Guard

digits

were

considered

SYSTEMS ASPECTS

3 1 Instruction

Sets

sufficiently

important

IBM

that

Languages

and Compders

1968

added

a guard

digit

to the

double

Exception

Handling

precision

format

the

System/360

ar-

DETAILS

4 1 Rounding

chitecture

(single

precision

already

had

Error

guard

digit)

and

retrofitted

all

existing

Bmary-to-Decimal

Conversion

4 3

Errors

Summatmn

machines

the

field.

Two

examples

are

5 SUMMARY

APPENDIX

ACKNOWLEDGMENTS

REFERENCES

given

illustrate

the

utility

guard

digits.

The

IEEE

standard

goes further

than

just

requiring

the

use of a guard

digit.

gives

an algorithm

for

addition,

subtrac-

tion,

multiplication,

division,

and

square

optimizing

compilers,

and

exception

root

and

requires

that

implementations

handling.

All

produce

the

same

result

that

algo-

the

statements

made

about

float-

rithm.

Thus,

when

a program

moved

ing-point

are provided

with

justifications,

from

one machine

to another,

the

results

but

those

explanations

not

central

to the

of the

basic

operations

will

be the

same

main

argument

are

section

called

every

bit

if both

machines

support

the

The

Details

and

can

skipped

de-

IEEE

standard.

This

greatly

simplifies

sired.

In particular,

the

proofs

of many

the

porting

programs.

Other

uses

the

theorems

appear

this

section.

The

this

precise

specification

are

given

end

of each

m-oof is marked

with

the

Section

1.5.

symbol;

whe~

a proof

is not

included,

the

appears

immediately

following

the

statement

of the

theorem.

2.1

Floating-Point

Formats

Several

different

representations

of real

ROUNDING

ERROR

numbers

have

been

proposed,

but

far

the

most

widely

used is the

floating-point

Squeezing

infinitely

many

real

numbers

representation.’

Floating-point

represen-

into

a finite

number

of bits

requires

tations

have

a base

O (which

always

approximate

representation.

Although

assumed

to be even)

and

a precision

there

are

infinitely

many

integers,

most

programs

the

result

of integer

com-

6 = 10 and

number

0.1 is rep-

the

resented

1.00

10-1.

P = 2

and

putations

can

stored

bits.

contrast,

given

any

fixed

number

of bits,

number

0.1

cannot

P =

24,

the

decimal

most

calculations

with

real

numbers

will

produce

quantities

that

cannot

be exactly

represented

using

that

many

bits.

There-

lExamples

other

representations

are

floatzng

fore,

the

result

of a floating-point

calcu-

slas;,

aud

szgned logan

th m [Matula

and

Kornerup

lation

must

often

be rounded

order

1985;

Swartzlander

and

Alexopoulos

1975]

ACM

Computing

Surveys,

Vol

23,

1, March

1991

Floating-Point

Arithmetic

100X22

101X22

11 O X221.11X22

[,!,1

Figure

Normalized

numbers

when

(3 = 2, p

3, em,n =

– 1, emax = 2.

o ‘m= or smaller

represented

exactly

but

approxi-

than

1.0

x ~em~. Most

this

paper

discusses

issues

due

the

mately

1.10011001100110011001101

first

reason.

Numbers

that

are

out

2-4.

general,

floating-point

num-

range

will,

however,

be discussed

Sec-

ber

will

be represented

~ d. dd

“ . .

tions

2.2.2

and

2.2.4.

where

called

the

x /3’,

d. dd

. . .

Floating-point

representations

are

not

and

has

digits.

pre-

significand2

necessarily

unique.

For

example,

both

cisely,

kdO.

dld2

“.”

dp_l

b’

repre-

0.01

and

1.00

represent

sents

the

number

101

10-1

0.1.

If the

leading

digit

is nonzero

[ do

eq.

(1)],

the

representation

said

do +

dl~-l

““.

+dP_l&(P-l))&,

(

The

floating-point

num-

normalized.

o<(il

<~.

(1)

ber

1.00

10-1

normalized,

whereas

0.01

101

not.

When

~ = 2,

The

term

will

floating-point

number

– 1, and

e~~X = 2,

there

are

e~i~

be used

to mean

a real

number

that

can

normalized

floating-point

numbers,

be exactly

represented

the

format

un-

shown

Figure

1. The

bold

hash

marks

der

discussion.

Two

other

parameters

correspond

to numbers

whose

significant

associated

with

floating-point

represen-

1.00.

Requiring

that

a floating-point

tations

are

the

largest

and

smallest

al-

representation

be normalized

makes

the

lowable

exponents,

e~~X and

e~,~.

Since

representation

unique.

Unfortunately,

there

are

(3P possible

significands

and

this

restriction

makes

impossible

e max

— e~i.

possible

exponents,

represent

zero!

natural

way

repre -

floating-point

number

can

be encoded

sent

O is

with

1.0

~em~- 1, since

this

+ [log2((3J’)]

preserves

the

fact

that

the

numerical

or-

1)]

L 1°g2

‘ma.

–

‘m,.

its,

where

the

final

+ 1 is for

the

sign

dering

of nonnegative

real

numbers

cor-

bit.

The

precise

encoding

not

impor-

responds

the

lexicographical

ordering

tant

for

now.

their

floating-point

representations.

There

are two

reasons

why

a real

num-

When

the

exponent

stored

bit

ber might

not be exactly

representable

field,

that

means

that

only

2 k –

1 values

a floating-point

number.

The

most

com-

are

available

for

use as exponents,

since

mon

situation

is illustrated

the

deci-

one must

be reserved

to represent

mal

number

0.1.

Although

it has

a finite

Note

that

the

floating-point

decimal

representation,

binary

has

number

is part

of the

notation

and

differ-

infinite

repeating

representation.

ent

from

a floating-point

multiply

opera-

Thus,

when

D = 2, the

number

0.1

lies

tion.

The

meaning

the

symbol

strictly

between

two

floating-point

num-

should

clear

from

the

context.

For

bers

and

is exactly

representable

by nei-

example,

the

expression

(2.5

10-3,

ther

of them.

A less common

situation

(4.0

102)

involves

only

a single

float-

that

a real

number

is out

of range;

that

ing-point

multiplication.

is,

its

absolute

value

is larger

than

f? x

2This

term

was

introduced

Forsythe

and

Moler

[196’71 and

has generally replaced the older term

3This

assumes

the

usual

arrangement

where

the

exponent

is stored

to the

left

of the

significant

mantissa.

ACM

Computing

Surveys, Vol.

23, No

1, March

1991

David

Goldberg

That

is,

/3’/~’+1.

1.2

Relative

Error

and

Ulps

Since

rounding

error

is inherent

in float-

:(Y’

;Ulp

;6-’.

(2)

ing-point

computation,

is important

have

a way

measure

this

error.

Con-

sider

the

floating-point

format

with

~ =

particular,

the

relative

error

corre -

and

which

will

used

spending

1/2

ulp

can

vary

a factor

throughout

this

section.

If the

result

of a

O. This

factor

called

the

wobble.

floating-point

computation

is 3.12

10’2

Setting

E = (~ /2)~-P

the

largest

and

the

answer

when

computed

infi-

the

bounds

in (2), we can say that

when

nite

precision

.0314,

clear

that

real

number

rounded

the

closest

this

error

units

the

last

number,

the

relative

error

floating-point

place.

Similarly,

the

real

number

is always

bounded

c, which

is referred

.0314159

is represented

as 3.14

10-2,

to as machine

epsilon

then

error

.159

units

the

example

above,

the

relative

er-

last

place.

In general,

if the

floating-point

~or was

.oo159i3,

~4159

= 0005.

To avoid

number

is used

to repre-

d. d

. . .

fle

such

small

numbers,

the

relative

error

sent

error

Id.

. . .

d–

normally

written

factor

times

I flp - 1 units

in the

last

place.4

The

( z//3’)

which

this

case

c = (~/2)P-P

term

will

be used

as shorthand

for

ulps

Thus,

the

relative

error

5(10)

-3

.005.

“units

the

last

place. ”

the

result

would

expressed

((.00159/

a calculation

the

floating-point

num -

3.14159)

/.oo5)e

= O.l E.

ber

nearest

the

correct

result,

still

illustrate

the

difference

between

might

be in error

by as much

as 1/2

ulp.

ulps

and

relative

error,

consider

the

real

Another

way

to measure

the

difference

number

x = 12.35.

is approximated

between

a floating-point

number

and

the

Z =

1.24

101. The

error

is 0.5 ulps;

the

real

number

is approximating

rela-

relative

error

0.8 e. Next

consider

the

which

the

difference

be-

tive

error,

computation

8x.

The

exact

value

is 8 x =

tween

the

two

numbers

divided

the

98.8,

whereas,

the

computed

value

is 81

real

number.

For

example,

the

relative

= 9.92

101. The

error

is now

4.0

ulps,

error

committed

when

approximating

but

the

relative

error

still

0.8 e. The

3.14159

by 3.14

10°

is .00159 /3.14159

error

measured

ulps

eight

times

.0005.

larger,

even

though

the

relative

error

To compute

the

relative

error

that

cor-

the

same.

general,

when

the

base is (3,

responds

to 1/2

ulp,

observe

that

when

a fixed

relative

error

expressed

ulps

real

number

approximated

the

can

wobble

a factor

of up

(3. Con-

closest

possible

floating-point

number

versely,

as eq. (2) shows,

a fixed

error

1/2

ulps

results

a relative

error

that

~e, the

absolute

error

can be

d dd

~. dd

can wobble

(3.

The

most

natural

way

measure

as large

as ‘(Y

where

/3’

rounding

error

ulps.

For

example,

the

digit

~/2.

This

error

is ((~/2)&P)

/3’

rounding

the

neared

flo~ting.point

the

form

Since

numb...

---

number

corresponds

1/2

ulp.

When

/3e all

have

this

same absolute

error

analyzing

the

rounding

error

caused

but

have

values

that

range

between

~’

various

formulas,

however,

relative

error

and

O x

the

relative

error

ranges

be-

fle,

a better

measure.

good

illustration

tween

((&/2 )~-’)

and

((~/2)&J’)

/3’//3’

this

the

analysis

immediately

fol-

lowing

the

proof

of Theorem

10.

Since

can

overestimate

the

effect

of rounding

the

nearest

floating-point

number

4Un1ess

the

number

larger

than

~em=+ 1 or

the

wobble

factor

of (3, error

estimates

smaller

than

(lem~. Numbers

that

are

out

of range

formulas

will

be tighter

on machines

with

this

fashion

will

not

be considered

until

further

a small

notice.

ACM

Computmg

Surveys,

Vol.

23, No

1, March

1991

Floating-Point

Arithmetic

When

only

the

order

of magnitude

wrong

every

digit!

How

bad

can

the

rounding

error

is of interest,

ulps

and

error

be?

may

be used

interchangeably

since

they

differ

by at most

a factor

of ~. For

exam-

Theorem 1

ple,

when

a floating-point

number

Using

floating-point

format

with

pa-

error

n ulps,

that

means

the

number

rameters

/3 and

and

computing

differ-

contaminated

digits

logD n.

the

ences

using

digits,

the

relative

error

relative

error

computation

ne,

the

result

can

be as large

as b –

then

relative

error

13–

Proofi

contaminated

digits

= log,n.

(3)

the

expression

x – y

occurs

when

x =

1.00”””

Oandy=.

pp.

””p,

wherep=

1.3 Guard

Digits

@–

1. Here

has

digits

(all

equal

Q). The

exact

difference

x – y = P ‘p.

One

method

of computing

the

difference

When

computing

the

answer

using

only

between

two

floating-point

numbers

is to

digits,

however,

the

rightmost

digit

compute

the

difference

exactly,

then

y gets

shifted

off,

so the

computed

differ-

round

the

nearest

floating-point

ence

P–p+l.

Thus,

the

error

p-p

–

number.

This

very

expensive

the

~-P(~

–

1),

and

the

relative

er-

@-P+l

operands

differ

greatly

size. Assuming

ror

$-P((3

–

l)/O-p

= 6 –

P = 3, 2,15

1012 –

1.25

10-5

would

be calculated

When

f? = 2, the

absolute

error

can be

as large

as the

result,

and

when

13= 10,

x = 2.15

1012

can

nine

times

larger.

put

y =

.0000000000000000125

1012

another

way,

when

(3 = 2, (3) shows

that

X – y

= 2.1499999999999999875

1012,

the

number

contaminated

digits

log2(l/~)

= logJ2

J’) = p.

That

is,

all

which

rounds

to 2.15

1012. Rather

than

the

p digits

in the

result

are wrong!

using

all

these

digits,

floating-point

Suppose

one

extra

digit

added

hardware

normally

operates

fixed

guard

against

this

situation

guard

number

of digits.

Suppose

the

number

That

is,

the

smaller

number

digit).

digits

kept

and

that

when

the

truncated

1 digits,

then

the

result

smaller

operand

shifted

right,

digits

of the

subtraction

is rounded

digits.

are

simply

discarded

(as

opposed

With

a guard

digit,

the

example

rounding).

Then,

2.15

1012 –

1.25

becomes

10-5

becomes

x = 1.010

101

x = 2.15

1012

y = 0.993

101

‘y = 0.00

1012

x–y=

.017

101,

x–y

=2.15x1012.

and

the

answer

exact.

With

a single

The

answer

is exactly

the

same

as if the

guard

digit,

the

relative

error

of the

re -

difference

had

been

computed

exactly

suit

may

be greater

than

~, as

110 –

then

rounded.

Take

another

example:

8.59:

10.1

– 9.93.

This

becomes

1.1OX

102

1.01

101

y =

.085

102

‘y = 0.99

101

1.015

102

z–y=

.02 x

101.

X–yz

This

rounds

102,

compared

with

the

The

correct

answer

.17,

so the

com-

correct

answer

101.41,

for

a relative

puted

difference

is off

30 ulps

and

error

.006,

which

greater

than

ACM

Computing

Surveys, Vol

23, No. 1, March

1991

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: