Tuning OpenBSD.Pdf

Running and tuning OpenBSD network servers

in a production environment

Philipp Buhler

sysve.com GmbH

pb@sysfive.com

Henning Brauer

BS Web Services

hb@bsws.de

October 8, 2002

Abstract

2 Resource Exhaustions

Heavily loaded network servers can experience

resource exhastion. At best, resource exhaus-

tion will slow server response, but left uncor-

rected, it can result in a crash of the server.

Running a publicly accessible server can always

lead to unexpected problems. Typically it hap-

pens that resources get exhausted. There are

numerous reasons for this, including:

In order to understand and prevent such sit-

uations, a knowledge of the internal operation

of the operating system is required, especially

how memory management works.

Low Budget There's not enough money to

buy \enough" hardware which would run

an untuned OS.

Peaks Overload situations which can be ex-

pected (e. g. special use) or not (e. g. get-

ting \slashdotted").

This paper will provide an understanding of

the memory management of OpenBSD, how to

monitor the current status of the system, why

crashes occur and how to prevent them.

DoS Denial-of-Service by attackers ooding

the server.

1 Motivation

No matter what reason leads to an exhaustion,

there are also dierent types of resources which

can suer from such a situation. We briey

show common types and countermeasures. Af-

terwards we go into detail about memory ex-

haustion.

Our main motivation for this paper was

the lack of comprehensive documentation

about tuning network servers running under

OpenBSD [Ope02], especially with regard to

the memory usage of the networking code in

the kernel.

2.1 I/O Exhaustion

It's very typical for network servers to suer in

this area. Often people just add more CPU

to \help" a slowly reacting server, but this

wouldn't help in such a case.

Either one can get general information, or is

\left alone" with the source code. This paper

outlines how to deal with these issues, with-

out reading the source code. At least one does

not need to start in \nowhere-land" and dig

through virtually everything.

Usually one can detect such an exhaustion by

using vmstat(8) or systat(8). Detailed usage

is shown in Section 5.1 There are also numer-

ous I/O \bottlenecks" possible, but one typical

indication is the CPU being mostly idle and

blocked processes waiting for resources. Fur-

ther distinctions can be made:

This paper aims to give a deeper understand-

ing on how the kernel handles connections and

interacts with userland applications like the

Apache webserver.

Disk

IRQ

The process is waiting for blocks from (or to)

the disk and cannot run on the CPU, even if

the CPU is idle. This case could be resolved by

moving from IDE to SCSI, and/or using RAID

technology. If repetitive writes/reads are being

done an increase of the lesystem-cache could

also help 1 . Filesystem-cache can be congured

with the kernel option BUFCACHEPERCENT 2 .

Every interrupt requires a context switch, from

the process running when the IRQ took place,

to the interrupt handler. As a number of things

must be done upon entering the interrupt han-

dler, a large quantity of interrupts can result

in excess time required for context switching.

One non-obvious way to reduce this load is to

share interrupts between the network adapters,

something permitted on the PCI bus. As many

people are not even aware of the the possibility

of interrupt sharing, and the benets are not

obvious, let's look at this a little closer.

NIC

Choosing the right network card is important

for busy servers. There are lots of low-end mod-

els like the whole Realtek range. These cards

are relatively dumb themselves. On the other

hand, there are chipsets with more intelligence.

DEC's 21143, supported by the dc(4) driver,

and Intel's newer chipsets, supported by the

fxp(4) driver, have been proven to work well in

high-load circumstances.

With separate adapters on separate interrupt

lines, when the rst interrupt comes in, a con-

text switch to the interrupt handler takes place.

If another interrupt comes in from the other

adapter while the rst interrupt is still being

handled, it will either interrupt the rst han-

dler, or be delayed until the rst handler has

completed, depending on priority, but regard-

less, two additional context switches will take

place { one into the second handler, one back

out.

Low-end cards usually generate an interrupt for

every packet received, which leads to the prob-

lems we describe in the next subsection. By us-

ing better cards, like the mentioned DEC and

Intel ones, packets are getting combined, thus

reducing the amount of interrupts.

In the case of the PCI and EISA busses, in-

terrupts are level triggered, not edge triggered,

which makes interrupt sharing possible. As

long as the interrupt line is held active, a device

needs servicing, even if the rst device which

triggered the interrupt has already been ser-

viced. So, in this case, when the rst adapter

triggers the interrupt, there will be a context

switch to the handler. Before the handler re-

turns, it will see if any other devices need ser-

vicing, before doing a context switch back to

the previous process.

Another important point is the physical media

interface, e. g. sqphy(4). Noise and distortion

is a normal part of network communications,

a good PHY will do a better job of extracting

the data from the noise on the wire than a poor

PHY will, reducing the number of network re-

transmissions required.

It might be a good idea to use Gigabit cards,

even when running 100 MBit/s only. They are

obviously built for much higher packet rates

(and this is the real problem, not bandwidth)

than FastEthernet ones, thus have more own

intelligence and deal better with high loads.

In a busy environment, when many devices are

needing service, saving these context switches

can signicantly improve performance by per-

mitting the processor to spend more time pro-

cessing data, rather than switching between

tasks. In fact, in a very high load situation,

it may be desireable to switch the adapters

and drivers from an interrupt driven mode to a

polling mode, though this is not supported on

OpenBSD at this time.

1 Though this has implications on the KVM, see the

appropriate section

2 for most kernel congurations, see options(4) and

config(8).

2.2 CPU Exhaustion

(or address space) it is recommended that espe-

cially the most active tasks (like the webserver

application) never be swapped out or even sub-

jected to paging.

Of course the CPU can be overloaded also while

other resources are still ne. Besides buying

more CPU power, which is not always possible,

there are other ways to resolve this problem.

Most common cases for this are:

With regard to reliability it's not critical if

the amount of physical RAM is exhausted and

heavy paging occurs, but performance-wise this

should not happen. The paging could compete

for Disk I/O with the server task, thus slow-

ing down the general performance of the server.

And, naturally, harddisks are slower than RAM

by magnitudes.

CGI Excessive usage of CGI scripts, usually

written in interpreter languages like PHP

or Perl. Better (resource-wise) coding

can help, as well as using modules like

mod perl 3 to reduce load.

It's most likely that countermeasures are taken

after the server starts heavy paging, but it

could happen that also the swap space, and

thus the whole VM, is exhausted. If this oc-

curs, sooner or later the machine will crash.

RDBM Usually those CGI scrips use a

database. Optimization of the connec-

tions and queries (Indexing, ..) is one way.

There is also the complete ooading of the

database to a dierent machine 4 .

Even if one doesn't plan for the server starting

to page out memory from RAM to swap, there

should be some swap space. This prevents a

direct crash, if the VM is exhausted. If swap

is being used, one has to determine if this was

a one-time-only peak, or if there is a general

increase of usage on the paging server. In the

latter case one should upgrade RAM as soon as

possible.

SSL Especially e-commerce systems or online

banking sites suer here. OpenBSD sup-

ports hardware-accelerators 5 . Typical

cryptographic routines used for SSL/TLS

can be ooaded to such cards in a trans-

parent manner, thus freeing CPU time for

processing requests.

3 Memory Exhaustion

In general it's good practice to monitor the VM

usage, especially to track down when the swap

space is being touched. See section 5 for details.

Another case of overloading can be the exhaus-

tion of memory resources. Also the speed of

the allocator for memory areas has signicant

inuence on the overall performance of the sys-

tem.

3.2 Kernel Virtual Memory (KVM)

Besides VM there is a reserved area solely for

kernel tasks. On the common i386 architecture

(IA-32) the virtual address space is 4GB. The

OpenBSD/i386 kernel reserves 768MB since

the 3.2 release (formerly 512MB) of this space

for kernel structures, called KVM.

3.1 Virtual Memory (VM)

VM is comprised of the physical RAM and pos-

sible swap space(s). Processes are loaded into

this area and use it for their data structures.

While the kernel doesn't really care about the

current location of the process' memory space

KVM is used for addressing the needs of man-

aging any hardware in the system and small

allocations 6 being needed by syscalls. The

biggest chunks being used are the management

of the VM (RAM and swap), lesystem-cache

and storage of network buers (mbuf).

3 This can have security implications, but this is an-

other story.

4 This could be unfeasible due to an already over-

loaded network or due to budget constraints.

5 crypto(4)

Contrary to userland the kernel allocations can-

6 like pathname translations

not be paged out (\wired pages"). Actually it's

possible to have pageable kernel memory, but

this is rarely used (e. g. for pipe buers) and

not a concern in the current context. Thus, if

the KVM is exhausted, the server will immedi-

atly crash. Of course 768MB is the limit, but if

there is less RAM available, this is the absolute

limit for wired pages then. Non-interrupt-safe

pages could be paged out, but this is a rare

exception.

we will concentrate on how the kernel is allocat-

ing memory; the userland process has no direct

inuence on this. The indirect inuence is the

sending and receiving of data to or from the

kernel by the userland process. For example

the server handles a lot of incoming network

data, which will ll up buer space (mbufs)

within the KVM. If the userland process is not

handling this data fast enough, KVM could be

exhausted. Of course the same is true if the

process is sending data faster than the kernel

can release it to the media, thus freeing KVM

buers.

Since RAM has to be managed by kernel maps

also, it's not wise to just upgrade RAM without

need. More RAM leaves less space for other

maps in KVM. Monitoring the \really" needed

amount of RAM is recommended, if KVM ex-

haustions occur. For example, 128MB for a

rewall is usually more than enough. Look at

Section 7.2 for a typical hardware setup of a

busy rewall.

4.1 mbuf

Historically, BSD uses mbuf(9) 8 routines to

handle network related data. An mbuf is a

data structure of xed size of 256 bytes 9 .

Since there is overhead for the mbuf header

(m hdr fg ) itself, the payload is reduced by at

least 20 bytes and up to 40 bytes 10 .

This complete area is called kernel map in the

source and has several \submaps" 7 . One main

reason for this is the locking of the address

space. By this mapping other areas of the

kernel can stay unlocked while another map is

locked.

Main submaps are kmem map, pager map,

mb map and exec map. The allocation is done

at boot-time and is never freed, the size is ei-

ther a compile-time or boot-time option to the

kernel.

Those additional 20 bytes overhead appear,

if the requested data doesn't t within two

mbufs. In such a case an external buer, called

cluster, with a size of 2048 bytes 11 , is allocated

and referenced by the mbuf (m ext fg ).

4 Resource Allocation

Mbufs belonging to one payload packet are

\chained" together by a pointer mh next.

mh nextpkt points to the next chain, forming

a queue of network data which can be pro-

cessed by the kernel. The rst member of such

a chain has to be a \packet header" (mh type

M PKTHDR).

Since the exhaustion of KVM is the most crit-

ical situation one can encounter, we will now

concentrate on how those memory areas are al-

located.

Allocation of mbufs and clusters are obtained

by macros (MGET, MCLGET, ..). Before

the release of OpenBSD 3.0 those macros used

malloc(9) to obtain memory resources.

Userland applications cannot allocate KVM

needed for network routines directly. KVM is

protected from userland processes completely,

thus there have to be routines to pass data

over this border. The userland can use a

syscall(2) to accomplish that. For the case

of networking the process would use socket(2)

related calls, like bind(2), recv(2), etc.

If there were a call to MGET but no more space

is left in the corresponding memory map, the

kernel would panic 12 .

8 memory buer

9 dened by MSIZE.

10 see /usr/include/sys/mbuf.h for details.

11 dened by MCLBYTES

12 \malloc: out of space in kmem map"

Having this layer between userland and kernel,

7 see /sys/uvm/uvm km.c

4.2 pool

5 Memory Measurement

Nowadays OpenBSD uses pool(9) routines to

allocate kernel memory. This system is de-

signed for fast allocation (and freeing) of xed-

size structures, like mbufs.

Obviously one wants to know about memory

exhaustion before it occurs. Additionally it can

be of interest, which process or task is using

memory. There are several tools provided in

the base OpenBSD system for a rough moni-

toring of what is going on. For detailed anal-

ysis one has to be able to read and interpret

the values provided by those tools, but some-

times one needs more details and can rely on

3rd party tools then.

There are several advantages in using pool(9)

routines instead of the ones around malloc(9):

faster than malloc by caching constructed

objects

Example outputs of the tools mentioned can be

found in the Appendix.

cache coloring (using osets to more e-

ciently use processor cache with real-world

hardware and programming techniques)

5.1 Common tools

avoids heavy fragmentation of available

memory, thus wasting less of it

These are tools provided with OpenBSD, where

some are rather well-known, but some are not.

In any case, we have found that often the tools

are used in a wrong fashion or the outputs are

misinterpreted. It's quite important to under-

stand what is printed out, even if it's a \known

tool".

provides watermarks and callbacks, giving

feedback about pool usage over time

only needs to be in kmem map if used from

interrupts

can use dierent backend memory alloca-

tors per pool

top

VM can reclaim free chunks before paging

occurs, not more than to a limit (Maxpg)

though

One of the most used tools is top(1). It shows

the current memory usage of the system. In

detail one could see the following entries:

If userland applications are running on

OpenBSD (> 3.0), pool(9) routines will be

used automatically. But it's interesting for peo-

ple who plan (or do so right now) to write own

kernel routines where using pool(9) could gain

signicant performance improvements.

Real: 68M/117M act/tot, where 68MB are

currently used and another 49MB are allo-

cated, but not currently used and may be

subject to be freed.

Free: 3724K, shows the amount of free physical

RAM

Additionally large chunks formerly in the

kmem map have been relocated to the ker-

nel map by using pools. Allocations for inodes,

vnodes, .. have been removed from kmem map,

thus there is more space for mbufs, which need

protection against interrupt reentrancy, if used

for e. g. incoming network data from the NIC

13 .

Swap: 24M/256M used/tot, 24MB of 256MB

currently available swap space is used.

If one adds 3724kB to 117MB, the machine

would have nearly 122MB RAM. This is, of

course, not true. It has 128MB of RAM; the

\missing" 6MB are used as lesystem-cache 14 .

14 dmesg: using 1658 buffers containing 6791168

bytes (6632K) of memory

13 kmem map has to be protected by splvm(), see

spl(9).

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: