Adrian Cockcroft's Blog

Monday, October 31, 2005

Help! I've lost my memory! Updated Sunworld Column

Originally published in Unix Insider 10/1/95
Stripped of adverts, url references fixed and comments added in red type to bring it up to date ten years later.

Dear Adrian, After a reboot I saw that most of my computer's memory was
free, but when I launched my application it used up almost all the
memory. When I stopped the application the memory didn't come back!
Take a look at my vmstat output:

% vmstat 5
procs     memory            page            disk          faults      cpu
r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id

This is before the program starts:

0 0 0 330252 80708   0   2  0  0  0  0  0  0  0  0  1   18  107  113  0  1 99
0 0 0 330252 80708   0   0  0  0  0  0  0  0  0  0  0   14   87   78  0  0 99

I start the program and it runs like this for a while:

0 0 0 314204  8824   0   0  0  0  0  0  0  0  0  0  0  414  132   79 24  1 74
0 0 0 314204  8824   0   0  0  0  0  0  0  0  0  0  0  411   99   66 25  1 74

I stop it, then almost all the swap space comes back, but the free memory does not:

0 0 0 326776 21260   0   3  0  0  0  0  0  0  1  0  0  420  116   82  4  2 95
0 0 0 329924 24396   0   0  0  0  0  0  0  0  0  0  0  414   82   77  0  0 100
0 0 0 329924 24396   0   0  0  0  0  0  0  0  2  0  1  430   90   84  0  1 99

I checked that there were no application processes running. It looks like a huge memory leak in the operating system. How can I get my memory back?
--RAMless in Ripon

Update

This remains one of the most frequently asked questions of all time. The
original answer is still true for many Unix variants. However while
writing his book on Solaris Internals, Richard McDougall worked out
how to fix Solaris to make it work better, and to make this aparrent
problem go away. The result was one of the most significant
performance improvements in Solaris 8, but the first edition of his
book was written before Solaris 8 came out, so doesn't describe the
fix!

The short answer

Launch your application again. Notice that it starts up more quickly than it did the first time, and with less disk activity. The application code and its data files are still
in memory, even though they are not active. The memory they occupy is
not "free." If you restart the same application it finds
the pages that are already in memory. The pages are attached to the
inode cache entries for the files. If you start a different
application, and there is insufficient free memory, the kernel will
scan for pages that have not been touched for a long time, and "free"
them. Once you quit the first application, the memory it occupies is
not being touched, so it will be freed quickly for use by other
applications.

In 1988, Sun introduced this feature in SunOS 4.0. It still applies to
all versions of Solaris 1 and 2. The kernel is trying to avoid disk
reads by caching as many files as possible in memory. Attaching to a
page in memory is around 1,000 times faster than reading it in from
disk. The kernel figures that you paid good money for all of that
RAM, so it will try to make good use of it by retaining files you
might need.

Since Solaris 8, the memory in the file cache is actually also on the free
list, so you do see vmstat free memory reduce when you quit a
program. You also should expect large amounts of file I/O to cause
high scan rates on older Solaris releases, and for there to be no
scanning at all on Solaris 8 systems. If Solaris 8 scans at all, then
it has truly run out of memory and is overloaded.

By contrast, Memory leaks appear as a shortage of swap space after the
misbehaving program runs for a while. You will probably find a
process that has a larger than expected size. You should restart the
program to free up the swap space, and check it with a debugger that
offers a leak-finding feature (run it with the libumem version of the malloc library that instruments memory leaks).

The long (and technical) answer

To understand how Sun's operating systems handle memory, I will explain how the inode cache works, how the buffer cache fits into the picture, and how the life
cycle of a typical page evolves as the system uses it for several
different purposes.

The inode cache and file data caching

Whenever you access a file, the kernel needs to know the size, the access permissions,
the date stamps and the locations of the data blocks on disk.
Traditionally, this information is known as the inode for the file.
There are many filesystem types. For simplicity I will assume we are
only interested in the Unix filesystem (UFS) on a local disk. Each
filesystem type has its own inode cache.

The filesystem stores inodes on the disk; the inode must be read into
memory whenever an operation is performed on an entity in the
filesystem. The number of inodes read per second is reported as
iget/s by the sar -a command. The inode read from disk is cached in case
it is needed again, and the number of inodes that the system will
cache is influenced by a kernel parameter called ufs_ninode.
The kernel keeps inodes on a linked list, rather than in a fixed-size
table.

As I mention each command I will show you what the output looks like. In
my case I'm collecting sar
data automatically using cron.
sar, which defaults to
reading the stored data for today. If you have no stored data,
specify a time interval and sar
will show you current activity.

% sar -a

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  iget/s namei/s dirbk/s
01:00:01       4       6       0

All reads or writes to UFS files occur by paging from the filesystem. All
pages that are part of the file and are in memory will be attached to
the inode cache entry for that file. When a file is not in use, its
data is cached in memory, using an inactive inode cache entry. When
the kernel reuses an inactive inode cache entry that has pages
attached, it puts the pages on the free list; this case is shown by
sar -g as %ufs_ipf.
This number is the percentage of UFS inodes that were overwritten in
the inode cache by iget and
that had reusable pages associated with them. The kernel flushes the
pages, and updates on disk any modified pages. Thus, this %ufs_ipf
number is the percentage of igets with page flushes. Any non-zero
values of %ufs_ipf reported by sar -g
indicate that the inode cache is too small for the current workload.

% sar -g

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
01:00:01     0.02     0.02     0.08     0.12     0.00

For SunOS 4 and releases up to Solaris 2.3, the number of inodes that the
kernel will keep in the inode cache is set by the kernel variable
ufs_ninode. To simplify: When a file is opened, an inactive
inode will be reused from the cache if the cache is full; when an
inode becomes inactive, it is discarded if the cache is over-full. If
the cache limit has not been reached then an inactive inode is placed
at the back of the reuse list and invalid inodes (inodes for files
that longer exist) are placed at the front for immediate reuse. It is
entirely possible for the number of open files in the system to cause
the number of active inodes to exceed ufs_ninode; raising
ufs_ninode allows more inactive inodes to be cached in case
they are needed again.

Solaris 2.4 uses a more clever inode cache algorithm. The kernel maintains a
reuse list of blank inodes for instant use. The number of active
inodes is no longer constrained, and the number of idle inodes
(inactive but cached in case they are needed again) is kept between
ufs_ninode and 75 percent of ufs_ninode by a new
kernel thread that scavenges the inodes to free them and maintains
entries on the reuse list. If you use sar -v to look at the inode cache, you may see a larger
number of existing inodes than the reported "size."

% sar -v

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  proc-sz    ov  inod-sz    ov  file-sz    ov   lock-sz
01:00:01   66/506     0 2108/2108    0  353/353     0    0/0

Buffer cache

The buffer cache is used to cache filesystem
data in SVR3 and BSD Unix. In SunOS 4, generic SVR4, and Solaris 2,
it is used to cache inode, indirect block, and cylinder group blocks
only. Although this change was introduced in 1988, many people still
incorrectly think the buffer cache is used to hold file data. Inodes
are read from disk to the buffer cache in 8-kilobyte blocks, then the
individual inodes are read from the buffer cache into the inode
cache.

Life cycle of a typical physical memory page

This section provides additional insight into the way memory is used. The sequence
described is an example of some common uses of pages; many other
possibilities exist.

1. Initialization -- A page is born: When the system boots, it forms all free memory into pages, and allocates
a kernel data structure to hold the state of every page in the
system.
2. Free -- An untouched virgin page: All the memory is put onto the free list to start with. At this stage the
content of the page is undefined.
3. ZFOD -- Joining an uninitialized data segment: When a program accesses data that is preset to zero for the very first
time, a minor page fault occurs and a Zero Fill On Demand (ZFOD)
operation takes place. The page is taken from the free list,
block-cleared to contain all zeroes, and added to the list of
anonymous pages for the uninitialized data segment. The program then
reads and writes data to the page.
4. Scanned -- The pagedaemon awakes: When the free list gets below a certain size, the pagedaemon starts to
look for memory pages to steal from processes. It looks at all pages
in physical memory order; when it gets to the page, the page is
synchronized with the memory management unit (MMU) and a reference
bit is cleared.
5. Waiting -- Is the program really using this page right now?: There is a delay that varies depending upon how quickly the pagedaemon
scans through memory. If the program references the page during this
period, the MMU reference bit is set.
6. Pageout Time -- Saving the contents: The pageout daemon returns and checks the MMU reference bit to find that
the program has not used the page so it can be stolen for reuse. The
pagedaemon checks to see if anything had been written to the page;
if it contains no data, a page-out occurs. The page is moved to the
pageout queue and marked as I/O pending. The swapfs code clusters
the page together with other pages on the queue and writes the
cluster to the swap space. The page is then free and is put on the
free list again. It remembers that it still contains the program
data.
7. Reclaim -- Give me back my page!: Belatedly, the program tries to read the page and takes a page fault. If the
page had been reused by someone else in the meantime, a major fault
would occur and the data would be read from the swap space into a
new page taken from the free list. In this case, the page is still
waiting to be reused, so a minor fault occurs, and the page is moved
back from the free list to the program's data segment.
8. Program Exit -- Free again: The program finishes running and exits. The data segments are private to
that particular instance of the program (unlike the shared-code
segments), so all the pages in the data segment are marked as
undefined and put onto the free list. This is the same state as Step
2.
9. Page-in -- A shared code segment: A page fault occurs in the code segment of a window system shared library.
The page is taken off the free list, and a read from the filesystem
is scheduled to get the code. The process that caused the page fault
sleeps until the data arrives. The page is attached to the inode of
the file, and the segments reference the inode.
10. Attach -- A popular page: Another process using the same shared-library page faults in the same place.
It discovers that the page is already in memory and attaches to the
page, increasing its inode reference count by one.
11. COW -- Making a private copy: If one of the processes sharing the page tries to write to it, a
copy-on-write (COW) page fault occurs. Another page is grabbed from
the free list, and a copy of the original is made. This new page
becomes part of a privately mapped segment backed by anonymous
storage (swap space) so it can be changed, but the original page is
unchanged and can still be shared. Shared libraries contain jump
tables in the code that are patched, using COW as part of the
dynamic linking process.
12. File Cache -- Not free: The entire window system exits, and both processes go away. This time
the page stays in use, attached to the inode of the shared library
file. The inode is now inactive but will stay in the inode cache
until it is reused, and the pages act as a file cache in case the
user is about to restart the window system again. The
change made in Solaris 8 was that the file cache is the tail of the
free list, and any file cache page can be reused immediately for
something else without needing to be scanned first.
13. fsflush -- Flushed by the sync: Every 30 seconds all the pages in the system are examined in physical page
order to see which ones contain modified data and are attached to a
vnode. The details differ between SunOS 4 and Solaris 2, but
essentially any modified pages will be written back to the
filesystem, and the pages will be marked as clean.
This example sequence can continue from Step 4 or
Step 9 with minor variations. The fsflush
process occurs every 30 seconds by default for all pages, and
whenever the free list size drops below a certain value, the
pagedaemon scanner wakes up and reclaims some pages. A
recent change in Solaris 10, backported to Solaris 8 and 9 patches,
makes fsflush run much more efficiently on machines with very large
amounts of memory. However, if you see fsflush using an excessive
amount of CPU time you should increase “autoup” in /etc/system
from its default of 30s, and you will see fsflush usage reduce
proportionately.

Now you know

I have seen this missing-memory question
asked about once a month since 1988! Perhaps the manual page for
vmstat should include a better explanation of what the
values are measuring. This answer is based on some passages from my
book Sun Performance and Tuning. The book explains in detail how the
memory algorithms work and how to tune them. However the book doesn't cover the changes made in Solaris 8.

Tuesday, October 25, 2005

SunWorld Columns at ITworld

I found that the columns I wrote between 1995 and 1999 all seems to be online, but its hard to find, so I'm going to start by just listing everything I can find in its original form. BEWARE! Many of these articles were obsoleted by developments in later releases of Solaris, or refer to dead URL's so I'll work through them in subsequent blog entries and provide commentary and updates.

2001/03: Collected Short Questions and Answers
1999/08: What does 100 percent busy mean?
1999/07: Disk Error Detection
1999/03: Digging into the details of WorkShop 5.0
1999/02: SyMON and SE get upgraded
1998/12: Out and about at conferences
1998/11: What's new in Solaris 7?
1998/10: IOwait, what's the holdup?
1998/09: Do collision levels accurately tell you the real story about your Ethernet?
1998/08: Unlocking the kernel
1998/07: Clearing up swap space confusion
1998/06: How busy is the CPU, really?
1998/05: Processor partitioning
1998/04: Prying into processes and workloads
1998/03: Sizing up memory in Solaris
1998/02: Perfmeter unmasked
1998/01: SE Toolkit FAQ
1997/12: At last! The updated SE release has arrived
1997/11: Performance perplexities: Help! Where do I start?
1997/10: Learn to performance tune your Java programs
1997/09: Clarifying disk measurements and terminology
1997/08: How does Solaris 2.6 improve performance stats and Web performance?
1997/07: Dissecting proxy Web cache performance
1997/06: Analysis of TCP transfer characteristics for Web servers made easier
1997/05: The memory go round
1997/04: Craving more books on Solaris? Look no further
1997/03: How to optimize caching file accesses
1997/02: Increase system performance by maximizing your cache
1997/01: Design your cache to match your applications
1996/12: Tips for TCP/IP monitoring and tuning
1996/11: The right disk configurations for servers
1996/10: Solving the iostat disk mystery
1996/09: Unveiling vmstat's charms
1996/08: What's the best way to probe processes?
1996/07 Monitor your Web server in realtime, Part 2
1996/07: How can I optimize my programs for UltraSPARC?
1996/06: How do disks really work?
1996/05: How much RAM is enough?
1996/04: Monitor your Web server in real time, Part 1
1996/04: How does swap space work?
1996/03: Watching your Web server
1996/02: Which is better, static or dynamic linking?
1996/01: What are the tunable kernel parameters for Solaris 2?
1995/12: Performance Q&A Compendium
1995/11: When is it faster to have 64 bits?
1995/10 Help! I've lost my memory!

Sunday, October 02, 2005

How busy is your CPU, really?

Just in case you thought that you could compare your CPU utilization data across Solaris releases I have a few words of caution.

To start with there is the whole problem of interrupts, do they count as system time, or do they just make whatever they interrupted take longer?

Then there is the question of wait-for-io, and who is waiting for which io? This is a form of idle time that tends to confuse people as it doesn't really mean anything once you have more than one CPU.

There is one mechanism used to report the systemwide CPU utilization data. This data is reported in a per-cpu kstat data structure and is used by every tool that ever reports CPU usr, sys, wio, idle etc. Most tools sum the data over all the CPUs, mpstat gives you the per CPU data. The form of the data is a number of ticks of CPU time that accumulates, starting at zero at boot time. To measure CPU utilization over a time interval, you measure the difference in the number of ticks and divide by the time, and the tick rate. The tick rate is set by the clock interrupt, it defaults to 100Hz, but can be set to 1000Hz.

There are two mechanisms used to measure CPU usage, one is the old method of using the clock interrupt to see what is running every 100Hz. This is low resolution, and since the clock wakes up jobs, its possible for jobs to hide between the ticks, so its a statistically biased measure.
The other mechanism is microstate accounting, where every change of state is timed with a hires clock. This was only used for tracking CPU usage by processes, and needed special calls to get the data in releases up to Solaris 9.

So how do those nice numbers you get from vmstat or your favourite tool vary?

Solaris 8 and earlier releases: interrupt time does not get classified as system time

Solaris 9: Interrupt time is counted as system time.

Solaris 10: wait-for-io time will always be zero. The ratio of confusion to enlightenment was too high, so the entire concept has been removed from Solaris. This is a good thing.

Solaris 10 initial release: systemwide CPU time is now measured using microstates, this is far more accurate than ticks, but somehow the interrupt time ended up spread out rather than in system time.

This is unfortunate, but I'm told that an update of Solaris 10 will again classify interrupts as system time, and then it will all be just about as accurate as it can be.

You may be curious about the size of errors in these measurements, the answer is that "it depends". There is an SE toolkit script called cpuchk.se that looks at the process data and compares the ticks with the microstate data. However, I don't have any measures of interrupt time differences.

SE toolkit 3.4 on Solaris 10 Opteron Workaround

The current 3.4 build of SE is available from the sunfreeware site and supports Solaris 8, 9 and 10 for SPARC and x86. It also includes full source code of the interpreter for the first time.
However it only includes 32bit x86 support, and when run on a 64bit Solaris kernel on an Opteron system, it will fail to run. This is due to the return value from the isalist command returning amd64 as the first word, rather than pentium. You can get it going with a workaround by changing the startup script /opt/RICHPse/bin/se to match the string "*pentium*" rather than "pentium*". This lets the 32bit x86 binary run on the 64bit amd64 system, and most of the scripts will still work. Some that try to access the kernel directly viakvm will fail, but most scripts use kstat which doesn't need 64bit accesses.
The real fix is to use the source to compile an amd64 build....

Monday, September 12, 2005

SunWorld Offline

The monthly columns I wrote for SunWorld Online between 1995 and 1999 seem to have gone offline. At one point after SunWorld shut down I got permission from the editors at ITworld to take back the content, remove the adverts, and host it on the Sun site. Since I left Sun in 2004 the link has been closed down. I think there is still some useful content and historical interest in the columns, which seem to have been copied to a few other places already. I'm going to dig them out of my archives, and repost them here, along with comments and perhaps a few corrections and updates.

The original idea for the column came from a featured article on the Sun homepage in 1995 where I talked about the release of my first book, Sun Performance and Tuning - SPARC and Solaris. This generated an offer from the editors at SunWorld Online to do a monthly Performance Q&A column. I thought it worked out very well. I got paid for the column, which forced me to write something every month, and I based the columns on questions from the readers, or parts of the original book that needed to be updated, or parts of the second edition that I was writing during this period.

The monthly column seemed to have the effect of positioning me as the expert to quite a wide public, and I'm sure it helped me get promoted to Distinguished Engineer in 1999. I stopped writing the column due to a combination of being too busy, running out of topics to write about, and the introduction of the BluePrint program.

In effect, several engineers had created books and columns in their spare time, and we worked with management to make it an official program for publishing Sun's best practices on how to configure products and combinations of products. This made it our day job, which was easier in some ways, and we had support staff and more detailed reviews, but we didn't get paid royalties for doing the BluePrint columns or books. After a few years and several books (The very first BluePrint on Resource Management, and one on Capacity Planning) I moved on from the BluePrint group, but the program is still running and there is a large body of work there now.

There are over forty columns. I could post them en-masse unedited but I think they can already be found lurking in Google. It seems like more fun to try and remember what was going on at the time and make it an anecdotal reminiscence.

Monday, August 15, 2005

Solving storage tuning problems

I wrote a while ago about Dave Fisk's Ortera Atlas tool for storage analysis. I recently had a chance to use a beta release of Atlas on a real problem, and they are about to do a GA release, its ready for prime time.

Like most tools, it can produce masses of numbers and graphs, but compared to other storage analysis tools I've seen it goes further in three ways:

1) It collects critically important data that is not provided by the OS
2) It processes the data to tell you exactly what is wrong
3) It runs heuristics to tell you how to fix the problem

I wish more tools spent this much effort on solving the actual problem rather than making pretty graphs that only an expert would understand.

What we actually did was run the tool on a pre-production Oracle system using Veritas Filesystem and Volume Manager with Solaris on a SAN connected to a Hitachi storage array. Atlas starts off by looking at all the active processes on the system, and ignoring any that are not doing any I/O. It collects data on which files are being read or written by which process, and what the pattern and sizes are at the system call, file system and device level. You can also set the tool to focus on a set of devices, and gather information on the processes that actually talk to those devices.

Atlas immediately pointed out that two volumes had been concatenated to form a filesystem, and that 98% of the accesses were to one of the volumes. It recommended that the volumes be striped together for better overall performance.

It also pointed out that some of the I/O accesses were taking two seconds to complete at the filesystem level, but only two milliseconds at the device level. I guessed this was CPU starvation caused by fsflush running flat out on this machine which had over 50GB of RAM. Adding set autoup=600 to /etc/system and rebooting made the problem go away. We also saw this effect in the terminal window, where our typing would stop echoing for a few seconds every now and again. I've been told by Sun that the very latest patches finally fix fsflush so that it can't use a lot of CPU time, so large memory machines will finally work properly without needing this tweak.

Finally Atlas showed that the filesystem block size was set too small and Oracle was doing large reads that were being chopped into smaller reads by the filesystem layer before being sent to the device. It gave a specific recommendation for the block size that should be used. Reconfiguring the disks takes a long time to do, but we'll fix it before it goes into production.

We could have figured out the concatenation problem using iostat data, but the other two problems are normally invisible, and the topic of what filesystem block size to use can generate masses of discussion and confusion, so having "virtual Dave Fisk" tell you what blocksize to use can save a lot of time :-)

Saturday, July 23, 2005

Extracct code posted on orcaware.com

Since the code I was writing to extract useful information from extended accounting logs is intended to end up as a data source for Orca, it made sense to host it on that site. I haven't had a chance to do any more development on it for a few months, so the source, including SPARC and x86 binaries is now posted there. Thanks to Blair Zajac for the space.

extracct - extract useful data from extended Solaris accounting

If anyone feels inspired to develop it further, please send me any changes you make.

Cheers Adrian

Update Feb 2006: I just downloaded it and untar'd it with no problem, so try downloading it again if you have any problems, should be about 54K long.

Adrian

Friday, July 15, 2005

Free performance and capacity tools

What free tools do you use for performance and capacity planning work? I'm particularly interested in figuring out combinations of tools that work well together.

Mario Jauvin and I are signed up to present a workshop this december at CMG05 in Florida on performance monitoring and capacity planning with free tools. I know a lot about some of the tools, but it would be useful to get some input on the relative usage level of what is available.

Here is a list of the kind of tools I'm interested in:

Orca, www.orcaware.com data collection using SE toolkit etc. and web display using rrdtool
BigBrother www.bb4.org and BigSister, bigsister.graeff.com which do you use and why?
OpenNMS, opennms.org/wiki Large scale network management system
Nagios, www.nagios.org/about Simpler and smaller scale network element management
SE Toolkit, www.sunfreeware.com/setoolkit.html Solaris specific instrumentation and tools
R, www.r-project.org statistical analysis and graphing package
MySQL, www.mysql.com database for storing large amounts of performance data
PDQ, www.perfdynamics.com toolkit for building analytical queueing models in C or Perl
Ganglia, ganglia.sourceforge.net large scale distributed system monitoring

Does the license model matter? There is a wide variety of licenses for free tools, from an absence of any license, through GPL to free use of commercially owned products. Do you use anything that is free, or does it have to meet some criteria?

I'm inviting a lot more input than I have before, add comments to this post or send me messages, I'll summarize, and I'll describe the tools in more detail and how they inter-relate in future postings.

Cheers Adrian

Wednesday, June 15, 2005

Playing with Fire

What would you do if your house burned down? Build another one, but that takes a while, so next time it catches fire you call the fire brigade. They get there in time to save some stuff, but you still have to rebuild it.

Since fires seem to happen rather often (bear with me on this, I'll get to my point eventually) you hire your own fireman. Thats better, you are putting out fires early enough now that the house remains habitable more often.

But the fireman is only there during working hours, so you hire a team of firemen to give you 24x7 coverage, and tidy up the burned patches that appear every few days.

However your spouse is not happy with the mess, the disruption and the cost of a team of firemen, and a nice salesman comes by and sells you a set of smoke detectors, alarm bells and water sprinklers, and you hire a cleanup crew with mops.

Its still messy and inconvenient, and everything is slightly soggy so you ask around. You find that some people are having fires much less often than you do, and they think its because they had their house checked over by a building standards inspector.

The inspector looked around your house and predicted the next few things that would start a fire - so you could fix them in advance. The inspector also advised you on how to build a new house that wouldn't catch fire so easily in the first place.

So my point is that if you are running a datacenter things will go wrong, and from a performance and capacity perspective rapid growth rates, fast changing applications and sudden changes in user activity levels can all put you into a fire fighting situation.
You can develop a fire fighting mentality, or a fire prevention mentality. One problem is that fire fighters are heroes, but when did you last hear the story of a heroic building inspector saving thousands of lives? Few people are fans of building inspectors, but they are mandated by local government to keep people safe.

The recent upsurge of interest in the Information Technology Infrastructure Library (ITIL) seems to be a response to the auditing requirements of Sarbanes-Oxley (SOX). ITIL specifies lots of best practices for capacity planning amongst other things, as my co-author Bill Walker described in our Sun Blueprint on Capacity Planning Book. So it may be SOX that gets you to look at ITIL, and helps you justify your building inspector status as the hero that helped pass a SOX audit perhaps? OK, its a stretch, but perhaps it helps get beyond the fire fighting mentality towards predictive capacity modelling.

So how well are you doing? One way to find out is to rate yourself on a maturity model, and there is a good paper on this in the current issue (3.06) of the CMG newsletter Measure-IT.

Friday, May 20, 2005

Performance monitoring with Hyperthreading

Hyperthreading is used by most recent Intel servers, Intel describe it here.

I tried searching for information on the performance impact and performance monitoring impact and found several studies that describe improvements in terms of throughput that ranged from negative impact (a slowdown) to speedups of 15-20% per CPU. I didn't see much discussion of the effects on observability or of effects on response time, so thats what I'm going to concentrate on writing about.

First, to summarize the information that I could find, the early measurements of performance included most of the negative impact cases. This is due to two separate effects, software and hardware improvements mean that the latest operating systems on the most recent Intel Pentium 4 and Xeon chipsets have a larger benefit and minimise any negative effects.

From a software point of view, Intel advises that Hyperthreading should be disabled for anything other than Windows XP and recent releases of RedHat and SuSe Linux. Older operating systems are naiive about Hyperthreading and while they run, they don't schedule jobs optimally. I would add that older Solaris x86 releases are also unaware of Hyperthreading and Solaris 9 and 10 for x86 include optimized support. I have also been told that Linux 2.6 includes optimizations for Hyperthreading that have been backported to recent patches for Linux 2.4.

From a hardware point of view, the benefit of Hyperthreading increases as CPU clock rates and pipeline lengths increase. Longer waits for pipeline stalls due to memory references and branch mispredictions create larger "pipeline bubbles" for the second Hyperthread to run in. It is possible that there may also be improvements in the implementation of Hyperthreading at the silicon level as Intel learns from early experiences and improves its designs.

The fundamental problem with monitoring a Hyperthreaded system is that one of the most basic and common assumptions made by capacity planners is invalid. For many years the CPU utilization of a normal system has been used as a primary capacity metric since it is assumed to be linearly related to the capacity and throughput of a normal system that has not saturated. In other words, if you keep the CPU usage below 80-90% busy, you expect that the throughput at 60% busy is about twice the throughput at 30% busy for a constant workload.

Hyperthreaded CPUs are non-linear, in other words they do not "scale" properly. A typical two CPU Hyperthreaded system behaves a bit like two normal fast CPUs with two very slow CPUs which are only active when the system is busy. The OS will normally report it as a four CPU system.

Whereas the service time (CPU used) for an average transaction remains constant on a normal CPU as the load increases, the service time for a Hyperthreaded CPU increases as the load increases.

If all you are looking at is the CPU %busy reported by the OS, you will see a Hyperthreaded system perform reasonably well up to 50% busy then as a small amount of additional load is added the CPU usage shoots up and the machine saturates.

To be clear, the Hyperthreads do normally increase the peak throughput of the system, they often provide a performance benefit, but they make it extremely difficult to manage the capacity of a system.

I've been looking at calibration methods and models that can deal with the capacity measurement problem, and will discuss them in future posts.

Thursday, May 12, 2005

Seeing storage problems

Change of topic...

Those of you who have dealt with Sun storage performance over the years might have heard of Dave Fisk, we became good friends when I worked together with him in the same group many years ago, then he went to work for the Sun storage business unit, and finally he left Sun to build a storage performance analysis product. After a lot of work over several years he has finally got his product ready, and it cracks open the storage stack so you can see what is really happening.

Dave's company is called ORtera and they have a downloadable beta release available which you can get a free one month evaluation license for.

If you ever wondered why the I/O traffic your application (e.g. Oracle) is generating don't seem to match up with the stats you see in iostat, then this tool will show you why. For example, small filesystem writes can cause larger read-modify-write operations to occur and the tool will show you this, point it out with a heuristic and suggest a filesystem tuning fix.

Its a useful addition to the toolbox, I think they should have called it "virtual Dave" :-)

Wednesday, May 04, 2005

extraccting on SPARC

I made it with no problems on SPARC Solaris 10 (using /usr/sfw/bin/gcc) and it seems to be working fine. The binary doesn't work on Solaris 8 because its looking for a later version SUNW_1.2 of libexacct.so.1 and I don't have a compiler on this Solaris 8 machine, so thats enough testing for today...

extracct 1.0

I cleaned up the build directory and renamed the program extracct - extract useful info from exacct - and its been running fine on Solaris 10/x64 for a while. If anyone wants to try it out let me know and I'll email a tarfile, I haven't yet got around to testing it on SPARC. I built it on Opteron using gmake/gcc from /usr/sfw/bin.

I'm also wondering how best to make it more generally available, perhaps via orcaware.com or sunfreeware.com, or I could post it to BigAdmin or the OpenSolaris source code as an updated demo. Any opinions out there?

Thursday, April 28, 2005

Pesky style sheet format.....

Sorry that some of the lines are wrapped and others run off the edge, I'm still figuring out how to make the composition tools and template do what I want, rather than what they want....

Automated rotating exdump

I've added an option that automates the processing of exacct data to text file format and rotates the logs, it is hard wired to create a new log file in /var/adm/exacct (which is an empty directory created during Solaris install as standard) and it includes the hostname and the date+time in the new file.

The previous log name is used as the base of the output file, which is written to a specified directory with a .txt extension.

The file /etc/acctadm.conf is maintained by the acctadm command, I read the log file names from it, and if there is no log file I don't start one. i.e. you need to manually start accounting with acctadm the first time to decide which logs you want to generate. The command syntax now looks like this:

# ./exdump
Usage: exdump [-vwr] [ file | -a dir ]

-v: verbose
-w: wracct all processes first
-r: rotate logs
-a dir: use acctadm.conf to get input logs, and write output files to dir

The usual way to run the command will be from cron as shown

0 * * * * /opt/exdump/exdump -war /var/tmp/exacct > /dev/null 2>&1
2 * * * * /bin/find /var/adm/exacct -ctime +7 -exec rm {} \;

This also shows how to clean up old log files, I only delete the binary files in this example, and I created /var/tmp/exacct to hold the text files. The process data in the text file looks like this:

timestamp locltime duration procid ppid uid usr sys majf rwKB vcxK icxK sigK sycK arMB mrMB command
1114734370 17:26:10 0.0027 16527 16526 0 0.000 0.002 0 0.53 0.00 0.00 0.00 0.1 0.7 28.9 acctadm
1114734370 17:26:10 0.0045 16526 16525 0 0.000 0.001 0 0.00 0.00 0.00 0.00 0.1 1.1 28.9 sh
1114734370 17:26:10 0.0114 16525 8020 0 0.001 0.005 0 1.71 0.00 0.00 0.00 0.3 1.0 28.9 exdump
1109786952 10:09:12 -1.0000 0 0 0 0.012 118.883 131 0.00 0.33 0.00 0.00 0.0 0.0 0.0 sched
1109786952 10:09:12 -1.0000 0 0 0 0.012 118.883 131 0.00 0.33 0.00 0.00 0.0 0.0 0.0 sched
1109786952 10:09:12 -1.0000 0 0 0 0.012 118.883 131 0.00 0.33 0.00 0.00 0.0 0.0 0.0 sched
1109786959 10:09:19 -1.0000 1 0 0 4.311 3.066 96 47504.69 49.85 0.18 0.34 456.2 0.9 1.0 init
1109786959 10:09:19 -1.0000 2 0 0 0.000 0.000 0 0.00 0.00 0.00 0.00 0.0 0.0 0.0 pageout

Since the -w option forced long running processes to write an accounting record they have not finished, and I chose to mark this by showing the duration as -1. One odd thing shown here is that the sched process generated three identical accounting records, this needs some more investigation.

I'm going to run this for a while, and install it on a few other machines including SPARC Solaris 10 and Solaris 8 to make sure its robust.

Monday, April 18, 2005

Flow accounting printout

I went to edit the code and realized that I hadn't finished some previous work, so I added a printout of the flow accounting data, an example is shown below.

The timestamp is the start of the flow, I translate to local timezone and show that as well. I difference the two timestamps to get the duration of the flow in seconds; add the uid and project id (which are only present for outbound flows); show the protocol; decode the IP addresses and look them up as names if possible, along with their port numbers; list the number of packets and bytes in the flow; and end with the filter name from the ipqos definition.

timestamp  locltime dur   uid  proj prot           srcip:port            dstip:port  npkts  nbytes aname
1113850483 11:54:43  13     0     0  tcp   66.102.15.100:80               crun:55783    21    4148  acct
1113850483 11:54:43  13   100    10  tcp            crun:55783   66.102.15.100:80       13   15082  acct
1113850483 11:54:43  13     0     0  tcp            crun:55783   66.102.15.100:80        6     240  acct
1113850480 11:54:40  16   100    10  tcp            crun:55782   66.102.15.101:80        3    1253  acct
1113850480 11:54:40  16     0     0  tcp            crun:55782   66.102.15.101:80        7     280  acct

Thursday, April 14, 2005

Sequencing exacct logs

I decided to simplify the next step and to just get raw data logged in a useful manner. To do this there needs to be a periodic process that generates readable log files. After thinking about this for a while, I think I have a simple and effective way to do it.

I will modify the exdump code to add another option "-a". This will invoke acctadm to switch to a new set of log files, with a datestamped name. Before it does the switch, it will need to cause all current processes to write accounting entries, so "-a" will normally be used in conjunction with the "-w" option I added already. After the log switch, the old log files will be processed from their binary form to a text file with one record per line, ready for consumption by futher processing steps.
When "-a" is specified, a directory can be specified on the command line to hold output files, the input filename does not need to be specified as it was for the original version. This makes it easy to invoke exdump directly from cron without needing a wrapper script.

The sequence is:


Obtain existing log files by running acctadm or reading /etc/acctadm.conf
Note which logs are being generated and their filenames
Run wracct on all existing processes
Run acctadm to switch to new log files with date-time-stamped names
Process existing log files to text form in the current directory or a 
   specified directory using same base filenames

To start exacct the user needs to manually run acctadm once to enable the logs of interest (flow, task or proc) then exdump will generate the output logs at the cron interval. A find command can be used from cron to remove old logs and keep disk space managed.

Since many commands run on the hour, it might be worth starting exdump a few seconds early, and have the file timestamp round up to the next minute so that any command sequences started from cron on the hour are less likely to get split up.

This still leaves the problem of processing individual accounting records into useful data, but at least the one line per record text form with interval data will be more tractable than the binary file.

Friday, April 01, 2005

Writing accounting records at time intervals

A major new feature of the exacct system is the ability to get an accounting record logged without terminating the process. There are two forms of this, for tasks you can get the record to dump the delta since the last record was logged. Somehow the task remembers the data each time it cuts a record so it can do the differencing. This seems to be too much overhead at the process level, so the other option is to cut a record that logs the same data as if the process had just exited, and this option is available for both tasks and processes.

The command that causes a record to be written is "wracct" and it takes a list of process or task id's and makes a system call to cause the record to be written. You have to be root to do this. The wracct command line syntax is a pain if you want to get it to dump multiple processes, as shown in this example from the manpage:


# /usr/sbin/wracct -i "`pgrep sendmail`" process

I want to make every process cut a record, and if you attempt to do this with wracct you need to form a list of every process id on the system. I tried to do this by listing all the entries in the /proc filesystem, but if any of the pid's exit before wracct gets to them it gets an error and quits. This is stupid, because if a process has exited, it has already cut an accounting record! The wracct command should have a "-a" option that writes all records and ignores errors.

I modified the exdump command to have a "-w" option that loops over all processes and forces an accounting record to be written before it reads the accounting file. If you aren't root, it has no effect. The code looks like this:


 if (wflag) {
  DIR *dirp;
  struct dirent *dp;

  dirp = opendir("/proc");
              do {
                 if ((dp = readdir(dirp)) != NULL)
     (void) wracct(P_PID, atoi(dp->d_name), EW_PARTIAL);
              } while (dp != NULL);
               (void) closedir(dirp);
                
 }

The next step is a bit more complex. Both for flow accounting and interval process accounting, I need code that remembers previous records and matches them with subsequent ones so that they can be accumulated (flows) or differenced (process intervals).

Thursday, March 24, 2005

Extracting useful exacct process data

I modified the /usr/demo/libexacct/exdump.c example code to include the data structures described in previous posts, and made the code update each item in the structure. Then I added a printf from the data structure so I could check that all the data is being captured correctly. Part of the output is shown below. I hand edited the column headers a bit for the first two lines to make them line up.


procid ppid uid  usr  sys time      majf minf rwKB     vcxK   icxK sigK sycK  arMB mrMB command
1693  1679 100 40.23 3.53 416930.11 237    0 58745.32 261.17 4.22 0.00 801.7 38.4 44.6 mozilla-bin
procid ppid uid  usr  sys time      majf minf rwKB    vcxK icxK sigK sycK arMB mrMB command
1679  1647 100 0.00 0.00 416930.16    0    0 15.11   0.02 0.00 0.00  0.6  0.6 26.4 run-mozilla.sh
procid ppid uid  usr  sys time      majf minf rwKB vcxK icxK sigK sycK arMB mrMB command
1647  1646 100 0.00 0.01 416930.30    0    0 8.29 0.03 0.00 0.00  0.8  0.4 26.4 mozilla

Running the modified exdump with the -v option shows the data objects in detail so I can check that the right data is appearing in the right place. Here is the same data as the second line above. It looks as if the minor faults counter really is zero, which is probably a Solaris bug.


   100  group-proc                      [group of 35 object(s)]
 1000   pid                            1679
 1001   uid                            100
 1002   gid                            10             staff
 1004   projid                         10             group.staff
 1003   taskid                         54
 100b   cpu-user-sec                   0
 100c   cpu-user-nsec                  2861447
 100d   cpu-sys-sec                    0
 100e   cpu-sys-nsec                   4150875
 1007   start-sec                      1109810711     03/02/05 16:45:11
 1008   start-nsec                     813223175
 1009   finish-sec                     1110227641     03/07/05 12:34:01
 100a   finish-nsec                    968607249
 1006   command                        "run-mozilla.sh"
 100f   tty-major                      4294967295
 1010   tty-minor                      4294967295
 1005   hostname                       "crun"
 1011   faults-major                   0
 1012   faults-minor                   0
 1014   msgs-snd                       0
 1013   msgs-recv                      0
 1015   blocks-in                      0
 1016   blocks-out                     0
 1017   chars-rdwr                     15105
 1018   ctxt-vol                       21
 1019   ctxt-inv                       0
 101a   signals                        0
 101b   swaps                          0
 101c   syscalls                       586
 101d   acctflags                      0
 101f   ppid                           1647
 1020   wait-status                    0              exit
 1021   zone                           "global"
 1022   memory-rss-avg-k               584
 1023   memory-rss-max-k               27004
procid ppid uid  usr  sys time      majf minf rwKB  vcxK icxK sigK sycK arMB mrMB command
1679  1647 100 0.00 0.00 416930.16    0    0 15.11 0.02 0.00 0.00  0.6  0.6 26.4 run-mozilla.sh

Wednesday, March 16, 2005

An exercise in complexity....

Time for a grumble....

My plan was to take the libexacct.so API and expose it as an SE toolkit class. After looking at what it would take to do this I have come to the conclusion that the data structure definitions and API for reading the data are too complex.

The design is so abstract that it seems that reading meaningful data out of the log file is some obscure side effect of the code. You can read the data, but there is no guarantee that any specific item of data will be present. The accounting system has various options to send more or less data to the file, so it needs to be flexible, but the important thing is the meaning of the data being logged. I care about the semantic and informational content of the data source. What I get from exacct is "there are some tagged typed objects in this file". I can't consume the data without making assumptions about it, and the API doesn't embody those assumptions.

Some of the data being reported is useless (blocks in and blocks out are archaic measures that are always zero) and other stuff is missing - like the good microstate information on CPU wait, page-in wait etc.

I'm going to take the exdump.c code and turn it into a library that exports a sane and simple set of abstractions (like the data structures I defined in earlier posts).

Archive