If you want to know the total amount of resources consumed, per-process accounting records can be kept. This facility is not normally turned on. The records are gathered, and processed after the fact. The impact on the system of the data-gathering procedure itself is minimal, and this is important: you want to

*observe* the behavior of the system while

*affecting* that behavior as little as possible in the act of doing so. So you run the system, gathering the results to a file, and then process them using any suitable statistics package.

One of the first and most important tasks that will face you, in getting

*meaningful* results from your study, will be to appropriately

__classify__ the various types of workload that you observe. You must separate out the various types of jobs and consider their behavior as a group separately from all the other groups. Otherwise the outcome will be murky and no usable results will be obtained. All "processes" are not created equal. And even though all the members of "type #1 activities" (whatever they are) are also not created equal,

*the members of each group* share common characteristics that allow them to be compared

*with each other.* The proportionate size of each group, at any particular point in time, can also be measured ... and it may be necessary to slice the data further so that you only consider periods-in-time when a certain proportion was found to exist.

You might have to experiment with various approaches, but you'll know when you've "got it." Suddenly,

*crisply*-definable characteristics appear, and the various observations neatly and

*clearly* fall into one bucket or another, without "straddling the lines." But now... what should you

__do__ with it?

A good strategy for evaluating the data is to define some "goal," then measure the number of times that the goal was or was not met; or, establish the

*probability* that the goal will be met. To illustrate the difference, consider these:

- The system will maintain 90% CPU utilization.
- The system will process at least 1,000 transactions per second.
- All class-1 transactions will be completed within 0.01 seconds, with a P=0.95 (95% probability that this goal will be met, by any such transaction selected at random).

Both the first and second expressions of a "goal" talk only about

*the hardware,* basically measuring "how loudly it hums." They are measuring only an

*average,* when what we are usually much more interested in is

*variance.* (Recall the stats-professor's classic example ... "Three women whose average age is 20 are about to enter the classroom" ... "A 59-year old grandmother and her twin six-month-old grandaughters." ... 59 + 0.5 + 0.5 = 60 ... 60 / 3 = 20.") If a processing system is degrading, we want it to do so "gracefully."

Many of the statistical methods commonly available also assume a

*normal distribution* ("the bell curve"), and are completely invalid if it has not been estalished first that such a distribution actually exists. When measuring a production process, normality often does

__not__ exist. We are taking a high-level sample from a complex system whose parts are causally related; are not independent. Therefore the samples are not, and cannot be, truly "independent," and a test of normality will generally show that they are not. Therefore, even basic stats like "standard deviation"

*cannot*, meaningfully at least, be used... not on the raw-data itself, anyway.

Only the third "goal" is both measurable and useful to the end-user: we don't just want to know what

*the system* is doing; we want to know what it is doing

*for the user.* This is

__exactly__ the same approach that a light-bulb manufacturer might use, in saying that "every light-bulb tested will light up and remain lit for at least 24 hours, with P=0.99." Statistical process control; the same for a computer as for a light-bulb production line or a factory that manufactures potato chips. You

*randomly sample a small *__group__ of records from the data-stream, evaluate the

*group* to see (yes or no) whether the goal is being met, and respond accordingly. This is a

*binomial* test; "yes, or no" "pass, or fail." From the characteristics of the sample ("in a sample of

*X* observations, there were

*Y* failures"), you can predict the behavior of the entire population,

*and* you can establish the confidence-interval ("P") .. how confident you actually can be that your prediction would prove to be correct.

Notice that, even for a system that itself does not exhibit a normal-distribution, the statistical properties of

*grouped samples* (as above) taken from that system

__may__ exhibit "normality"

*against each other.* Nonetheless, anytime that you intend to use a statistic that is based upon the hypothesis that such-and-so distribution exists, you

__must__ test to see if it actually does. Most people don't.

You can certainly devise a system that collects data continuously, pipes it to another machine for storage, and then,

*on a separate machine* (so as to avoid interference caused by running a web-server and so-on), draws a random sample of those records, evaluates them, and presents them as a running display with an HTML-screen for real-time monitoring.