Thursday, October 23, 2008

What exactly is a load average?

If you’ve spent some time on a Unix or Unix-like machine (e.g., Linux, OS X, Solaris, etc.) then you’re probably at least vaguely familiar with the concept of a load average. A system’s load average can be easily determined from the Unix shell by running the uptime command:mmalone@www:~$ uptime
15:37:38 up 133 days, 3:37, 3 users,
load average: 0.37, 0.37, 0.41
The load average is also displayed by the w and top commands, and by pretty much every system monitoring package on the planet. But what the heck is a load average, exactly?
To most people, a load average is some mysterious number that is somehow related to the amount of work that their computer is currently handling. But what is a good load average, and how high is too high? The answer is actually quite simple. But first you have to understand what the load average is actually measuring.
Without getting into the vagaries of every Unix-like operating system in existence, the load average more or less represents the average number of processes that are in the running (using the CPU) or runnable (waiting for the CPU) states. One notable exception exists: Linux includes processes in uninterruptible sleep states, typically waiting for some I/O activity to complete. This can markedly increase the load average on Linux systems.
The load average is calculated as an exponential moving average of the load number (the number of processes that are running or runnable). The three numbers returned as the system’s load average represent the one, five, and fifteen minute moving load average of the system.
So, for a single processor machine a load average of 1 means that, on average, there is always a process in the running or runnable state. Thus, the CPU is being utilized 100% of the time and is at capacity. If you tried to run another process, it would have to wait in the run queue before being executed. For multiprocessor systems, however, the system isn’t CPU bound until the load average equals the number of processors (or cores, for multi-core processors) in the machine. My database server, for example, has two dual core processors. Thus, the system isn’t fully utilized until the load average reaches 4.
In summary, the load average is a moving average of the number of processes in the running or runnable states. You shouldn’t be worried about your system’s load unless it is consistently higher than the number of processors (or cores) in your machine. In general, you can calculate a system’s CPU utilization by dividing the load average by the number of processors/cores in the system.

================================================================

1 UNIX Commands
Actually, load average is not a UNIX command in the conventional sense. Rather it's an embedded metric that appears in the output of other UNIX commands like uptime and procinfo. These commands are commonly used by UNIX sysadmin's to observe system resource consumption. Let's look at some of them in more detail.
1.1 Classic OutputThe generic ASCII textual format appears in a variety of UNIX shell commands. Here are some common examples.
uptimeThe uptime shell command produces the following output:
[pax:~]% uptime
9:40am up 9 days, 10:36, 4 users, load average: 0.02, 0.01, 0.00
It shows the time since the system was last booted, the number of active user processes and something called the load average.
procinfoOn Linux systems, the procinfo command produces the following output:
[pax:~]% procinfo
Linux 2.0.36 (root@pax) (gcc 2.7.2.3) #1 Wed Jul 25 21:40:16 EST 2001 [pax]
Memory: Total Used Free Shared Buffers Cached
Mem: 95564 90252 5312 31412 33104 26412
Swap: 68508 0 68508
Bootup: Sun Jul 21 15:21:15 2002 Load average: 0.15 0.03 0.01 2/58 8557
...
The load average appears in the lower left corner of this output.
wThe w(ho) command produces the following output:
[pax:~]% w
9:40am up 9 days, 10:35, 4 users, load average: 0.02, 0.01, 0.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
mir ttyp0 :0.0 Fri10pm 3days 0.09s 0.09s bash
neil ttyp2 12-35-86-1.ea.co 9:40am 0.00s 0.29s 0.15s w
...
Notice that the first line of the output is identical to the output of the uptime command.
topThe top command is a more recent addition to the UNIX command set that ranks processes according to the amount of CPU time they consume. It produces the following output:
4:09am up 12:48, 1 user, load average: 0.02, 0.27, 0.17
58 processes: 57 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 0.5% user, 0.9% system, 0.0% nice, 98.5% idle
Mem: 95564K av, 78704K used, 16860K free, 32836K shrd, 40132K buff
Swap: 68508K av, 0K used, 68508K free 14508K cched
PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
5909 neil 13 0 720 720 552 R 0 1.5 0.7 0:01 top
1 root 0 0 396 396 328 S 0 0.0 0.4 0:02 init
2 root 0 0 0 0 0 SW 0 0.0 0.0 0:00 kflushd
3 root -12 -12 0 0 0 SW< 0 0.0 0.0 0:00 kswapd
...
In each of these commands, note that there are three numbers reported as part of the load average output. Quite commonly, these numbers show a descending order from left to right. Occasionally, however, an ascending order appears e.g., like that shown in the top output above.
1.2 GUI OutputThe load average can also be displayed as a time series like that shown here in some output from a tool called ORCA.

Figure 1: ORCA plot of the 3 daily load averages.
Although such visual aids help us to see that the green curve is more spikey and has more variability than the red curve, and it allows us to see a complete day's worth of data, it's not clear how useful this is for capacity planning or performance analysis. We need to understand more about how the load average metric is defined and calculated.
2 So What Is It?So, exactly what is this thing called load average that is reported by all these various commands? Let's look at the official UNIX documentation.
2.1 The man Page
[pax:~]% man "load average"
No manual entry for load average
Oops! There is no man page! The load average metric is an output embedded in other commands so it doesn't get its own man entry. Alright, let's look at the man page for uptime, for example, and see if we can learn more that way.
...
DESCRIPTION
uptime gives a one line display of the following informa-
tion. The current time, how long the system has been run-
ning, how many users are currently logged on, and the sys-
tem load averages for the past 1, 5, and 15 minutes.
...
So, that explains the three metrics. They are the "... load averages for the past 1, 5, and 15 minutes."
Which are the GREEN, BLUE and RED curves, respectively, in Figure 1 above. Unfortunately, that still begs the question "What is the load?
2.2 What the Gurus Have to SayLet's turn to some UNIX hot-shots for more enlightenment.
Tim O'Reilly and CrewThe book UNIX Power Tools [POL97], tell us on p.726 The CPU:
The load average tries to measure the number of active processes at any time. As a measure of CPU utilization, the load average is simplistic, poorly defined, but far from useless.That's encouraging! Anyway, it does help to explain what is being measured: the number of active processes. On p.720 39.07 Checking System Load: uptime it continues ...
... High load averages usually mean that the system is being used heavily and the response time is correspondingly slow.
What's high? ... Ideally, you'd like a load average under, say, 3, ... Ultimately, 'high' means high enough so that you don't need uptime to tell you that the system is overloaded.Hmmm ... where did that number "3" come from? And which of the three averages (1, 5, 15 minutes) are they referring to?
Adrian Cockcroft on SolarisIn Sun Performance and Tuning [Coc95] in the section on p.97 entitled: Understanding and Using the Load Average, Adrian Cockcroft states:
The load average is the sum of the run queue length and the number of jobs currently running on the CPUs. In Solaris 2.0 and 2.2 the load average did not include the running jobs but this bug was fixed in Solaris 2.3.So, even the "big boys" at Sun can get it wrong. Nonetheless, the idea that the load average is associated with the CPU run queue is an important point.
O'Reilly et al. also note some potential gotchas with using load average ...
...different systems will behave differently under the same load average. ... running a single cpu-bound background job .... can bring response to a crawl even though the load avg remains quite low.As I will demonstrate, this depends on when you look. If the CPU-bound process runs long enough, it will drive the load average up because its always either running or runable. The obscurities stem from the fact that the load average is not your average kind of average. As we alluded to in the above introduction, it's a time-dependentaverage. Not only that, but it's a damped time-dependent average. To find out more, let's do some controlled experiments.
3 Performance ExperimentsThe experiments described in this section involved running some workloads in background on single-CPU Linux box. There were two phases in the test which has a duration of 1 hour:
CPU was pegged for 2100 seconds and then the processes were killed.
CPU was quiescent for the remaining 1500 seconds.
A Perl script sampled the load average every 5 minutes using the uptime command. Here are the details.
3.1 Test LoadTwo hot-loops were fired up as background tasks on a single CPU Linux box. There were two phases in the test:
The CPU is pegged by these tasks for 2,100 seconds.
The CPU is (relatively) quiescent for the remaining 1,500 seconds.
The 1-minute average reaches a value of 2 around 300 seconds into the test. The 5-minute average reaches 2 around 1,200 seconds into the test and the 15-minute average would reach 2 at around 3,600 seconds but the processes are killed after 35 minutes (i.e., 2,100 seconds).
3.2 Process SamplingAs the authors [BC01] explain about the Linux kernel, because both of our test processes are CPU-bound they will be in a TASK_RUNNING state. This means they are either:
running i.e., currently executing on the CPU
runnable i.e., waiting in the run_queue for the CPU
The Linux kernel also checks to see if there are any tasks in a short-term sleep state called TASK_UNINTERRUPTIBLE. If there are, they are also included in the load average sample. There were none in our test load.
The following source fragment reveals more details about how this is done.
600 * Nr of active tasks - counted in fixed-point numbers
601 */
602 static unsigned long count_active_tasks(void)
603 {
604 struct task_struct *p;
605 unsigned long nr = 0;
606
607 read_lock(&tasklist_lock);
608 for_each_task(p) {
609 if ((p->state == TASK_RUNNING
610 (p->state & TASK_UNINTERRUPTIBLE)))
611 nr += FIXED_1;
612 }
613 read_unlock(&tasklist_lock);
614 return nr;
615 }
So, uptime is sampled every 5 seconds which is the linux kernel's intrinsic timebase for updating the load average calculations.
3.3 Test ResultsThe results of these experiments are plotted in Fig. 2. NOTE: These colors do not correspond to those used in the ORCA plots like Figure 1.
Although the workload starts up instantaneously and is abruptly stopped later at 2100 seconds, the load average values have to catch up with the instantaneous state. The 1-minute samples track the most quickly while the 15-minute samples lag the furthest.
Figure 2: Linux load average test results.
For comparison, here's how it looks for a single hot-loop running on a single-CPU Solaris system.
Figure 3: Solaris load average test results.
You would be forgiven for jumping to the conclusion that the "load" is the same thing as the CPU utilization. As the Linux results show, when two hot processes are running, the maximum load is two (not one) on a single CPU. So, load is not equivalent to CPU utilization.
From another perspective, Fig. 2 resembles the charging
Figure 4: Charging and discharging of a capacitor.
and discharging of a capacitive RC circuit.
4 Kernel Magic
An Addendum
Now let's go inside the Linux kernel and see what it is doing to generate these load average numbers.
unsigned long avenrun[3];
624
625 static inline void calc_load(unsigned long ticks)
626 {
627 unsigned long active_tasks; /* fixed-point */
628 static int count = LOAD_FREQ;
629
630 count -= ticks;
631 if (count < 0) {
632 count += LOAD_FREQ;
633 active_tasks = count_active_tasks();
634 CALC_LOAD(avenrun[0], EXP_1, active_tasks);
635 CALC_LOAD(avenrun[1], EXP_5, active_tasks);
636 CALC_LOAD(avenrun[2], EXP_15, active_tasks);
637 }
638 }
The countdown is over a LOAD_FREQ of 5 HZ. How often is that?
1 HZ = 100 ticks
5 HZ = 500 ticks
1 tick = 10 milliseconds
500 ticks = 5000 milliseconds (or 5 seconds)
So, 5 HZ means that CALC_LOAD is called every 5 seconds.
4.1 Magic NumbersThe function CALC_LOAD is a macro defined in sched.h
58 extern unsigned long avenrun[]; /* Load averages */
59
60 #define FSHIFT 11 /* nr of bits of precision */
61 #define FIXED_1 (1<62 #define LOAD_FREQ (5*HZ) /* 5 sec intervals */
63 #define EXP_1 1884 /* 1/exp(5sec/1min) as fixed-point */
64 #define EXP_5 2014 /* 1/exp(5sec/5min) */
65 #define EXP_15 2037 /* 1/exp(5sec/15min) */
66
67 #define CALC_LOAD(load,exp,n) \
68 load *= exp; \
69 load += n*(FIXED_1-exp); \
70 load >>= FSHIFT;
A noteable curiosity is the appearance of those magic numbers: 1884, 2014, 2037. What do they mean? If we look at the preamble to the code we learn,
/*
49 * These are the constant used to fake the fixed-point load-average
50 * counting. Some notes:
51 * - 11 bit fractions expand to 22 bits by the multiplies: this gives
52 * a load-average precision of 10 bits integer + 11 bits fractional
53 * - if you want to count load-averages more often, you need more
54 * precision, or rounding will get you. With 2-second counting freq,
55 * the EXP_n values would be 1981, 2034 and 2043 if still using only
56 * 11 bit fractions.
57 */
These magic numbers are a result of using a fixed-point (rather than a floating-point) representation.
Using the 1 minute sampling as an example, the conversion of exp(5/60) into base-2 with 11 bits of precision occurs like this:
e5 / 60 ®
e5 / 60
211
(1)But EXP_M represents the inverse function exp(-5/60). Therefore, we can calculate these magic numbers directly from the formula,
EXP_M =
211
2 5 log2(e) / 60M
(2)where M = 1 for 1 minute sampling. Table 1 summarizes some relevant results.

T
EXP_T
Rounded
5/60
1884.25
1884
5/300
2014.15
2014
5/900
2036.65
2037
2/60
1980.86
1981
2/300
2034.39
2034
2/900
2043.45
2043 Table 1: Load Average magic numbers.
These numbers are in complete agreement with those mentioned in the kernel comments above. The fixed-point representation is used presumably for efficiency reasons since these calculations are performed in kernel space rather than user space.
One question still remains, however. Where do the ratios like exp(5/60) come from?
4.2 Magic RevealedTaking the 1-minute average as the example, CALC_LOAD is identical to the mathematical expression:
load(t) = load(t-1) e-5/60 + n (1 - e-5/60)
(3)If we consider the case n = 0, eqn.(3) becomes simply:
load(t) = load(t-1) e-5/60
(4)If we iterate eqn.(4), between t = t0 and t = T we get:
load(tT) = load(t0) e-5t/60
(5)which is pure exponential decay, just as we see in Fig. 2 for times between t0 = 2100 and tT = 3600. Conversely, when n = 2 as it was in our experiments, the load average is dominated by the second term such that:
load(tT) = 2 load(t0) (1 - e-5t/60)
(6)which is a monotonically increasing function just like that in Fig. 2 between t0 = 0 and tT = 2100.
5 Summary
So, what have we learned? Those three innocuous looking numbers in the LA triplet have a surprising amount of depth behind them.
The triplet is intended to provide you with some kind of information about how much work has been done on the system in the recent past (1 minute), the past (5 minutes) and the distant past (15 minutes).
As you will have discovered if you tried the LA Triplets quiz, there are problems:
The "load" is not the utilization but the total queue length.
They are point samples of three different time series.
They are exponentially-damped moving averages.
They are in the wrong order to represent trend information.
These inherited limitations are significant if you try to use them for capacity planning purposes. I'll have more to say about all this in the next online column Load Average Part II: Not Your Average Average.

1 Recap of Part 1
This is the second in a two part-series where I explore the use of averages in performance analysis and capacity planning. There are many manifestations of averages e.g., arithmetic average (the usual one), moving average (often used in financial planning), geometric average (used in the SPEC CPU benchmarks), harmonic average (not used enough), just to name a few.
In Part 1, I described some simple experiments that revealed how the load averages (the LA Triplets) are calculated in the UNIX® kernel (well, the Linux kernel anyway since that source code is available online). We discovered a C-macro called CALC_LOAD that does all the work. Taking the 1-minute average as the example, CALC_LOAD is identical to the mathematical expression:
load(t) = load(t - 1) e-5/60 + n (1 - e-5/60)
(1)
which corresponds to an exponentially-damped moving average. It says that your current load is equal to the load you had last time (decayed by an exponential factor appropriate for 1-minute reporting) plus the number of currently active processes (weighted by a exponentially increasing factor appropriate for 1-minute reporting). The only difference between the 1-minute load average shown here and the 5- and 15-minute load averages is the value of the exponential factors; the magic numbers I discussed in Part 1.
Another point I made in Part 1 was that we, as performance analysts, would be better off if the LA Triplets were reported in the reverse order: 15, 5, 1, because that ordering concurs with usual convention that temporal order flows left to right. In this way it would be easier to read the LA Triplets as a trend (which was part of the original intent, I suspect). Trending information could be enhanced even further by representing the LA Triplets using animation (of the type I showed in the Quiz).
Here, in Part 2, I'll compare the UNIX load averaging approach with other averaging methods as they apply to capacity planning and performance analysis.
2 Exponential Smoothing
Exponential smoothing (also called filtering by electrical engineering types) is a general purpose way of prepping highly variable data before further analysis. Filters of this type are available in most data analysis tools such as: EXCEL, Mathematica, and Minitab.
The smoothing equation is an iterative function that has the general form:
Y(t)smoothed
= Y(t - 1) +
aconstant
X(t)raw
- Y(t-1)
(2)
where X(t) is the input raw data, Y(t - 1) is the value due to the previous smoothing iteration and Y(t) is the new smoothed value. If it looks a little incestuous, it's supposed to be.
2.1 Smooth Loads Expressing the UNIX load average method (see equation (1)) in the same format produces:
load(t) = load(t-1) + EXP_R [ n(t) - load(t-1) ]
(3)
Eqn.(3) is equivalent to (2) if we chose EXP_R = 1 - a. The constant a is called the smoothing constant and can range between 0.0 and 1.0 (in other words, you can think of it as a percentage). EXCEL uses the terminology damping factor for the quantity (1 - a).
The value of a determines the percentage by which the current smoothing iteration should for changes in the data that produced the previous smoothing iteration. Larger values of a yield a more rapid response to changes in the data but produce coarser rather than smoother resultant curves (less damped). Conversely, smaller values of a produce very smoother curves but take much longer to compensate for fluctuations in the data (more damped). So, what value of a should be used?
2.2 Critical Damping
EXCEL documentation suggests 0.20 to 0.30 are ``reasonable'' values to choose for a. This is a patently misleading statement because it does not take into account how much variation in the data (e.g., error) you are prepared to tolerate.
From the analysis in Section 1 we can now see that EXP_R plays the role of a damping factor in the UNIX load average. The UNIX load average is therefore equivalent to an exponentially-damped moving average. The more usual moving average (of the type often used by financial analysts) is just a simple arithmetic average with over some number of data points.
The following Table 1 shows the respective smoothing and damping factors that are based on the magic numbers described in Part 1.

LA Factor
Damping
Correction
EXP_R
1 - aR
aR
EXP_1
0.9200 ( ≈ 92%)
0.0800 ( ≈ 8%)
EXP_5
0.9835 ( ≈ 98%)
0.0165 ( ≈ 2%)
EXP_15
0.9945 ( ≈ 99%)
0.0055 ( ≈ 1%) Table 1: UNIX load average damping factors.
The value of a is calculated from 1 - exp(-5/60R) where R = 1, 5 or 15. From Table 1 we see that the bigger the correction for variation in the data (i.e., aR), the more responsive the result is to those variations and therefore we see less damping (1 - aR) in the output.
This is why the 1-minute reports respond more quickly to changes in load than do the 15-minute reports. Note also, that the largest correction for the UNIX load average is about 8% for the 1-minute report and is nowhere near the 20% or 30% suggested by EXCEL.
3 Other Averages
Next, we compare these time-dependent smoothed averages with some of the more familiar forms of averaging used in performance analysis and capacity planning.
3.1 Steady-State Averages
The most commonly used average used in capacity planning, benchmarking and other kinds of performance modeling, is the steady-state average.
Figure 1: Load averages represented as a time series.
In terms of the UNIX load average, this would correspond to observing the reported loads over a sufficiently long time (T) much as shown in Fig. 1.
Note that sysadmins almost never use the load average metrics in this way. Part of the reason for that avoidance lies in the fact that the LA metrics are embedded inside other commands (which vary across UNIX platforms) are need to be extracted. TeamQuest View is an excellent example of the way in which such classic limitations in traditional UNIX performance tools have been partially circumvented.
3.2 Example ApplicationTo determine the steady-state average for the above time series we would first need to break up the area under the plot into set of uniform columns of equal width.
The width of each column corresponds to uniform time step Δt.
The height of each column corresponds to Q(Δt) the instantaneous queue length.
The area of each column is given by Q(Δt) * Δt (length * height).
The total area under the curve is ΣQ(Δt) * Δt The time-averaged queue length Q (the steady-state value) is then approximated by the fraction:
Q =
Σ
Q(Δt) * Δt
T
The longer the observation period, the more accurate the steady-state value. Fig. 2 makes this idea more explicit. It shows a time period where six request become enqueued (the black curve representing approximate columns).
Figure 2: Toy model with exponential smoothing for the 1-minute load average.
Superimposed over the top is the curve that corresponds to the 1-minute load average.

Figure 3: All three load average curves.
Fig. 3 shows all three load average metrics superimposed as well as the 5-second sample average.
3.3 Little's LawConsider the UNIX Timeshare scheduler.

Figure 4: Simple model of UNIX scheduler. The schematic in Fig. 4 depicts the scheduler states according to the usual UNIX conventions:
N: processes in the system
R: running or runable processes
D: uninterruptible processes
S: processes in a sleep state Applying steady-state averages of the type defined in Section 3.1 to other well-known performance metrics, such as:
Z: average time spent in sleep state
X: average process completion rate
S: average execution quantum (in CPU ticks)
W: total time spent in both running and runable state allows us to express some very powerful relationships between them and Q (the steady-state queue length). One such relationship is Little's Law
Q = X W which relates the average queue length (Q) to the average throughput (X) and the time (W):
W =
N
X
- Z In some sense, Q is the average of the load average. These same kind of averages are used in performance analyzer tools like Pretty Damn Quick and TeamQuest Model. Note, that such insightful relationships are virtually impossible to recognize without taking steady-state averages. Little's law is a case in point. It had existed as a piece of performance folklore many years prior to 1961 when J. D. Little published his now famous proof of the relationship.
4 Summary
So, what have we learnt from all this? Those three little numbers tucked away innocently in certain UNIX commands are not so trivial after all. The first point is that load in this context refers to run-queue length (i.e., the sum of the number of processes waiting in the run-queue plus the number currently executing). Therefore, the number is absolute (not relative) and thus it can be unbounded; unlike utilization (AKA ``load'' in queueing theory parlence).
Moreover, they have to be calculated in the kernel and therefore they must be calculated efficiently. Hence, the use of fixed-point arithmetic and that gives rise to those very strange looking constants in the kernel code. At the end of Part 1 I showed you that the magic number are really just exponential decay and rise constants expressed in fixed-point notation.
In Part 2 we found out that these constants are actually there to provide exponential smoothing of the raw instantaneous load values. More formally, the UNIX load average is an exponentially smoothed moving average function. In this way sudden changes can be damped so that they don't contribute significantly to the longer term picture. Finally, we compared the exponentially damped average with the more common type of averages that appear as metrics in benchmarks and performance models.
On average, the UNIX load average metrics are certainly not your average average.

No comments: