If you want to know the total amount of resources consumed, per-process accounting records can be kept. This facility is not normally turned on. The records are gathered, and processed after the fact. The impact on the system of the data-gathering procedure itself is minimal, and this is important: you want to observe
the behavior of the system while affecting
that behavior as little as possible in the act of doing so. So you run the system, gathering the results to a file, and then process them using any suitable statistics package.
One of the first and most important tasks that will face you, in getting meaningful
results from your study, will be to appropriately classify
the various types of workload that you observe. You must separate out the various types of jobs and consider their behavior as a group separately from all the other groups. Otherwise the outcome will be murky and no usable results will be obtained. All "processes" are not created equal. And even though all the members of "type #1 activities" (whatever they are) are also not created equal, the members of each group
share common characteristics that allow them to be compared with each other.
The proportionate size of each group, at any particular point in time, can also be measured ... and it may be necessary to slice the data further so that you only consider periods-in-time when a certain proportion was found to exist.
You might have to experiment with various approaches, but you'll know when you've "got it." Suddenly, crisply
-definable characteristics appear, and the various observations neatly and clearly
fall into one bucket or another, without "straddling the lines." But now... what should you do
A good strategy for evaluating the data is to define some "goal," then measure the number of times that the goal was or was not met; or, establish the probability
that the goal will be met. To illustrate the difference, consider these:
- The system will maintain 90% CPU utilization.
- The system will process at least 1,000 transactions per second.
- All class-1 transactions will be completed within 0.01 seconds, with a P=0.95 (95% probability that this goal will be met, by any such transaction selected at random).
Both the first and second expressions of a "goal" talk only about the hardware,
basically measuring "how loudly it hums." They are measuring only an average,
when what we are usually much more interested in is variance.
(Recall the stats-professor's classic example ... "Three women whose average age is 20 are about to enter the classroom" ... "A 59-year old grandmother and her twin six-month-old grandaughters." ... 59 + 0.5 + 0.5 = 60 ... 60 / 3 = 20.") If a processing system is degrading, we want it to do so "gracefully."
Many of the statistical methods commonly available also assume a normal distribution
("the bell curve"), and are completely invalid if it has not been estalished first that such a distribution actually exists. When measuring a production process, normality often does not
exist. We are taking a high-level sample from a complex system whose parts are causally related; are not independent. Therefore the samples are not, and cannot be, truly "independent," and a test of normality will generally show that they are not. Therefore, even basic stats like "standard deviation" cannot
, meaningfully at least, be used... not on the raw-data itself, anyway.
Only the third "goal" is both measurable and useful to the end-user: we don't just want to know what the system
is doing; we want to know what it is doing for the user.
This is exactly
the same approach that a light-bulb manufacturer might use, in saying that "every light-bulb tested will light up and remain lit for at least 24 hours, with P=0.99." Statistical process control; the same for a computer as for a light-bulb production line or a factory that manufactures potato chips. You randomly sample a small group of records
from the data-stream, evaluate the group
to see (yes or no) whether the goal is being met, and respond accordingly. This is a binomial
test; "yes, or no" "pass, or fail." From the characteristics of the sample ("in a sample of X
observations, there were Y
failures"), you can predict the behavior of the entire population, and
you can establish the confidence-interval ("P") .. how confident you actually can be that your prediction would prove to be correct.
Notice that, even for a system that itself does not exhibit a normal-distribution, the statistical properties of grouped samples
(as above) taken from that system may
exhibit "normality" against each other.
Nonetheless, anytime that you intend to use a statistic that is based upon the hypothesis that such-and-so distribution exists, you must
test to see if it actually does. Most people don't.
You can certainly devise a system that collects data continuously, pipes it to another machine for storage, and then, on a separate machine
(so as to avoid interference caused by running a web-server and so-on), draws a random sample of those records, evaluates them, and presents them as a running display with an HTML-screen for real-time monitoring.