Tuning OpenBSD.Pdf

(182 KB) Pobierz
29871017 UNPDF
Running and tuning OpenBSD network servers
in a production environment
Philipp Buhler
sysve.com GmbH
pb@sysfive.com
Henning Brauer
BS Web Services
hb@bsws.de
October 8, 2002
Abstract
2 Resource Exhaustions
Heavily loaded network servers can experience
resource exhastion. At best, resource exhaus-
tion will slow server response, but left uncor-
rected, it can result in a crash of the server.
Running a publicly accessible server can always
lead to unexpected problems. Typically it hap-
pens that resources get exhausted. There are
numerous reasons for this, including:
In order to understand and prevent such sit-
uations, a knowledge of the internal operation
of the operating system is required, especially
how memory management works.
Low Budget There's not enough money to
buy \enough" hardware which would run
an untuned OS.
Peaks Overload situations which can be ex-
pected (e. g. special use) or not (e. g. get-
ting \slashdotted").
This paper will provide an understanding of
the memory management of OpenBSD, how to
monitor the current status of the system, why
crashes occur and how to prevent them.
DoS Denial-of-Service by attackers ooding
the server.
1 Motivation
No matter what reason leads to an exhaustion,
there are also dierent types of resources which
can suer from such a situation. We briey
show common types and countermeasures. Af-
terwards we go into detail about memory ex-
haustion.
Our main motivation for this paper was
the lack of comprehensive documentation
about tuning network servers running under
OpenBSD [Ope02], especially with regard to
the memory usage of the networking code in
the kernel.
2.1 I/O Exhaustion
It's very typical for network servers to suer in
this area. Often people just add more CPU
to \help" a slowly reacting server, but this
wouldn't help in such a case.
Either one can get general information, or is
\left alone" with the source code. This paper
outlines how to deal with these issues, with-
out reading the source code. At least one does
not need to start in \nowhere-land" and dig
through virtually everything.
Usually one can detect such an exhaustion by
using vmstat(8) or systat(8). Detailed usage
is shown in Section 5.1 There are also numer-
ous I/O \bottlenecks" possible, but one typical
indication is the CPU being mostly idle and
blocked processes waiting for resources. Fur-
ther distinctions can be made:
This paper aims to give a deeper understand-
ing on how the kernel handles connections and
interacts with userland applications like the
Apache webserver.
Disk
IRQ
The process is waiting for blocks from (or to)
the disk and cannot run on the CPU, even if
the CPU is idle. This case could be resolved by
moving from IDE to SCSI, and/or using RAID
technology. If repetitive writes/reads are being
done an increase of the lesystem-cache could
also help 1 . Filesystem-cache can be congured
with the kernel option BUFCACHEPERCENT 2 .
Every interrupt requires a context switch, from
the process running when the IRQ took place,
to the interrupt handler. As a number of things
must be done upon entering the interrupt han-
dler, a large quantity of interrupts can result
in excess time required for context switching.
One non-obvious way to reduce this load is to
share interrupts between the network adapters,
something permitted on the PCI bus. As many
people are not even aware of the the possibility
of interrupt sharing, and the benets are not
obvious, let's look at this a little closer.
NIC
Choosing the right network card is important
for busy servers. There are lots of low-end mod-
els like the whole Realtek range. These cards
are relatively dumb themselves. On the other
hand, there are chipsets with more intelligence.
DEC's 21143, supported by the dc(4) driver,
and Intel's newer chipsets, supported by the
fxp(4) driver, have been proven to work well in
high-load circumstances.
With separate adapters on separate interrupt
lines, when the rst interrupt comes in, a con-
text switch to the interrupt handler takes place.
If another interrupt comes in from the other
adapter while the rst interrupt is still being
handled, it will either interrupt the rst han-
dler, or be delayed until the rst handler has
completed, depending on priority, but regard-
less, two additional context switches will take
place { one into the second handler, one back
out.
Low-end cards usually generate an interrupt for
every packet received, which leads to the prob-
lems we describe in the next subsection. By us-
ing better cards, like the mentioned DEC and
Intel ones, packets are getting combined, thus
reducing the amount of interrupts.
In the case of the PCI and EISA busses, in-
terrupts are level triggered, not edge triggered,
which makes interrupt sharing possible. As
long as the interrupt line is held active, a device
needs servicing, even if the rst device which
triggered the interrupt has already been ser-
viced. So, in this case, when the rst adapter
triggers the interrupt, there will be a context
switch to the handler. Before the handler re-
turns, it will see if any other devices need ser-
vicing, before doing a context switch back to
the previous process.
Another important point is the physical media
interface, e. g. sqphy(4). Noise and distortion
is a normal part of network communications,
a good PHY will do a better job of extracting
the data from the noise on the wire than a poor
PHY will, reducing the number of network re-
transmissions required.
It might be a good idea to use Gigabit cards,
even when running 100 MBit/s only. They are
obviously built for much higher packet rates
(and this is the real problem, not bandwidth)
than FastEthernet ones, thus have more own
intelligence and deal better with high loads.
In a busy environment, when many devices are
needing service, saving these context switches
can signicantly improve performance by per-
mitting the processor to spend more time pro-
cessing data, rather than switching between
tasks. In fact, in a very high load situation,
it may be desireable to switch the adapters
and drivers from an interrupt driven mode to a
polling mode, though this is not supported on
OpenBSD at this time.
1 Though this has implications on the KVM, see the
appropriate section
2 for most kernel congurations, see options(4) and
config(8).
29871017.018.png
2.2 CPU Exhaustion
(or address space) it is recommended that espe-
cially the most active tasks (like the webserver
application) never be swapped out or even sub-
jected to paging.
Of course the CPU can be overloaded also while
other resources are still ne. Besides buying
more CPU power, which is not always possible,
there are other ways to resolve this problem.
Most common cases for this are:
With regard to reliability it's not critical if
the amount of physical RAM is exhausted and
heavy paging occurs, but performance-wise this
should not happen. The paging could compete
for Disk I/O with the server task, thus slow-
ing down the general performance of the server.
And, naturally, harddisks are slower than RAM
by magnitudes.
CGI Excessive usage of CGI scripts, usually
written in interpreter languages like PHP
or Perl. Better (resource-wise) coding
can help, as well as using modules like
mod perl 3 to reduce load.
It's most likely that countermeasures are taken
after the server starts heavy paging, but it
could happen that also the swap space, and
thus the whole VM, is exhausted. If this oc-
curs, sooner or later the machine will crash.
RDBM Usually those CGI scrips use a
database. Optimization of the connec-
tions and queries (Indexing, ..) is one way.
There is also the complete ooading of the
database to a dierent machine 4 .
Even if one doesn't plan for the server starting
to page out memory from RAM to swap, there
should be some swap space. This prevents a
direct crash, if the VM is exhausted. If swap
is being used, one has to determine if this was
a one-time-only peak, or if there is a general
increase of usage on the paging server. In the
latter case one should upgrade RAM as soon as
possible.
SSL Especially e-commerce systems or online
banking sites suer here. OpenBSD sup-
ports hardware-accelerators 5 . Typical
cryptographic routines used for SSL/TLS
can be ooaded to such cards in a trans-
parent manner, thus freeing CPU time for
processing requests.
3 Memory Exhaustion
In general it's good practice to monitor the VM
usage, especially to track down when the swap
space is being touched. See section 5 for details.
Another case of overloading can be the exhaus-
tion of memory resources. Also the speed of
the allocator for memory areas has signicant
inuence on the overall performance of the sys-
tem.
3.2 Kernel Virtual Memory (KVM)
Besides VM there is a reserved area solely for
kernel tasks. On the common i386 architecture
(IA-32) the virtual address space is 4GB. The
OpenBSD/i386 kernel reserves 768MB since
the 3.2 release (formerly 512MB) of this space
for kernel structures, called KVM.
3.1 Virtual Memory (VM)
VM is comprised of the physical RAM and pos-
sible swap space(s). Processes are loaded into
this area and use it for their data structures.
While the kernel doesn't really care about the
current location of the process' memory space
KVM is used for addressing the needs of man-
aging any hardware in the system and small
allocations 6 being needed by syscalls. The
biggest chunks being used are the management
of the VM (RAM and swap), lesystem-cache
and storage of network buers (mbuf).
3 This can have security implications, but this is an-
other story.
4 This could be unfeasible due to an already over-
loaded network or due to budget constraints.
5 crypto(4)
Contrary to userland the kernel allocations can-
6 like pathname translations
29871017.019.png 29871017.020.png 29871017.021.png
not be paged out (\wired pages"). Actually it's
possible to have pageable kernel memory, but
this is rarely used (e. g. for pipe buers) and
not a concern in the current context. Thus, if
the KVM is exhausted, the server will immedi-
atly crash. Of course 768MB is the limit, but if
there is less RAM available, this is the absolute
limit for wired pages then. Non-interrupt-safe
pages could be paged out, but this is a rare
exception.
we will concentrate on how the kernel is allocat-
ing memory; the userland process has no direct
inuence on this. The indirect inuence is the
sending and receiving of data to or from the
kernel by the userland process. For example
the server handles a lot of incoming network
data, which will ll up buer space (mbufs)
within the KVM. If the userland process is not
handling this data fast enough, KVM could be
exhausted. Of course the same is true if the
process is sending data faster than the kernel
can release it to the media, thus freeing KVM
buers.
Since RAM has to be managed by kernel maps
also, it's not wise to just upgrade RAM without
need. More RAM leaves less space for other
maps in KVM. Monitoring the \really" needed
amount of RAM is recommended, if KVM ex-
haustions occur. For example, 128MB for a
rewall is usually more than enough. Look at
Section 7.2 for a typical hardware setup of a
busy rewall.
4.1 mbuf
Historically, BSD uses mbuf(9) 8 routines to
handle network related data. An mbuf is a
data structure of xed size of 256 bytes 9 .
Since there is overhead for the mbuf header
(m hdr fg ) itself, the payload is reduced by at
least 20 bytes and up to 40 bytes 10 .
This complete area is called kernel map in the
source and has several \submaps" 7 . One main
reason for this is the locking of the address
space. By this mapping other areas of the
kernel can stay unlocked while another map is
locked.
Main submaps are kmem map, pager map,
mb map and exec map. The allocation is done
at boot-time and is never freed, the size is ei-
ther a compile-time or boot-time option to the
kernel.
Those additional 20 bytes overhead appear,
if the requested data doesn't t within two
mbufs. In such a case an external buer, called
cluster, with a size of 2048 bytes 11 , is allocated
and referenced by the mbuf (m ext fg ).
4 Resource Allocation
Mbufs belonging to one payload packet are
\chained" together by a pointer mh next.
mh nextpkt points to the next chain, forming
a queue of network data which can be pro-
cessed by the kernel. The rst member of such
a chain has to be a \packet header" (mh type
M PKTHDR).
Since the exhaustion of KVM is the most crit-
ical situation one can encounter, we will now
concentrate on how those memory areas are al-
located.
Allocation of mbufs and clusters are obtained
by macros (MGET, MCLGET, ..). Before
the release of OpenBSD 3.0 those macros used
malloc(9) to obtain memory resources.
Userland applications cannot allocate KVM
needed for network routines directly. KVM is
protected from userland processes completely,
thus there have to be routines to pass data
over this border. The userland can use a
syscall(2) to accomplish that. For the case
of networking the process would use socket(2)
related calls, like bind(2), recv(2), etc.
If there were a call to MGET but no more space
is left in the corresponding memory map, the
kernel would panic 12 .
8 memory buer
9 dened by MSIZE.
10 see /usr/include/sys/mbuf.h for details.
11 dened by MCLBYTES
12 \malloc: out of space in kmem map"
Having this layer between userland and kernel,
7 see /sys/uvm/uvm km.c
29871017.001.png 29871017.002.png 29871017.003.png 29871017.004.png 29871017.005.png 29871017.006.png 29871017.007.png 29871017.008.png 29871017.009.png 29871017.010.png 29871017.011.png 29871017.012.png
4.2 pool
5 Memory Measurement
Nowadays OpenBSD uses pool(9) routines to
allocate kernel memory. This system is de-
signed for fast allocation (and freeing) of xed-
size structures, like mbufs.
Obviously one wants to know about memory
exhaustion before it occurs. Additionally it can
be of interest, which process or task is using
memory. There are several tools provided in
the base OpenBSD system for a rough moni-
toring of what is going on. For detailed anal-
ysis one has to be able to read and interpret
the values provided by those tools, but some-
times one needs more details and can rely on
3rd party tools then.
There are several advantages in using pool(9)
routines instead of the ones around malloc(9):
faster than malloc by caching constructed
objects
Example outputs of the tools mentioned can be
found in the Appendix.
cache coloring (using osets to more e-
ciently use processor cache with real-world
hardware and programming techniques)
5.1 Common tools
avoids heavy fragmentation of available
memory, thus wasting less of it
These are tools provided with OpenBSD, where
some are rather well-known, but some are not.
In any case, we have found that often the tools
are used in a wrong fashion or the outputs are
misinterpreted. It's quite important to under-
stand what is printed out, even if it's a \known
tool".
provides watermarks and callbacks, giving
feedback about pool usage over time
only needs to be in kmem map if used from
interrupts
can use dierent backend memory alloca-
tors per pool
top
VM can reclaim free chunks before paging
occurs, not more than to a limit (Maxpg)
though
One of the most used tools is top(1). It shows
the current memory usage of the system. In
detail one could see the following entries:
If userland applications are running on
OpenBSD (> 3.0), pool(9) routines will be
used automatically. But it's interesting for peo-
ple who plan (or do so right now) to write own
kernel routines where using pool(9) could gain
signicant performance improvements.
Real: 68M/117M act/tot, where 68MB are
currently used and another 49MB are allo-
cated, but not currently used and may be
subject to be freed.
Free: 3724K, shows the amount of free physical
RAM
Additionally large chunks formerly in the
kmem map have been relocated to the ker-
nel map by using pools. Allocations for inodes,
vnodes, .. have been removed from kmem map,
thus there is more space for mbufs, which need
protection against interrupt reentrancy, if used
for e. g. incoming network data from the NIC
13 .
Swap: 24M/256M used/tot, 24MB of 256MB
currently available swap space is used.
If one adds 3724kB to 117MB, the machine
would have nearly 122MB RAM. This is, of
course, not true. It has 128MB of RAM; the
\missing" 6MB are used as lesystem-cache 14 .
14 dmesg: using 1658 buffers containing 6791168
bytes (6632K) of memory
13 kmem map has to be protected by splvm(), see
spl(9).
29871017.013.png 29871017.014.png 29871017.015.png 29871017.016.png 29871017.017.png
Zgłoś jeśli naruszono regulamin