LinuxQuestions.org - Linux Load average concern

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Linux Load average concern (https://www.linuxquestions.org/questions/linux-newbie-8/linux-load-average-concern-4175611603/)

Linux Load average concern

Hi all,

I have recently started using nagios monitoring tool which compulsary uses linux so i have been started using linux in which i am getting more doubdts .

->what is load average means when i fire htop commands
->If any one using nagios can you explain me what exactly the Check_load will monitor and also confirm does this monitor CPU Utilization, Memory Utilization & IO wait.

You'd love to see your system "working hard, instead of 'hardly working.'" The thing to watch out for is an excess amount of "involuntary waiting," where processes want to do this-or-that but are "stuck in rush-hour traffic."

It is perfectly ordinary to see, for example, a certain amount of "I/O wait," because I/O devices take time to respond. Likewise, a certain amount of swapping and paging activity. One of the best ways to observe the pragmatic performance of a system is to monitor the progress of something which this system routinely has to do: for instance, "how long does it take to produce a reasonably-complex web page, without caching?" A monitoring tool (such as Nagios) can be programmed to periodically try this and to record how long it took each time. This is a "boots-on-the-ground" measure of the system's perceived performance, and the first indication that you might have some problem requiring further investigation.

https://www.howtogeek.com/194642/und...-like-systems/

In general you'd want your load average to be less than the number of physical cores on your system, otherwise chances are things are backing up for one reason or another. There are of course exceptions to this though.

That article perpetuates a Unix description of load average from a long time ago.

In what can only be described as unbelievable timing, Brendan Gregg has just posted *THE* definitive article on loadavg here. It's long-ish and somewhat technical, but it is all there is to say.

And note the quote at the end.

These three man pages will tell you more about load averages:

Code:

$ apropos load | grep average

getloadavg []        (3)  - get system load averages

tload []            (1)  - graphic representation of system load average

xload []            (1)  - system load average display for X

Also see

Code:

$ man apropos

One of the most useful commands to know is apropos. If you are not sure what man page to consult, it can help you find the most relevant ones.

Since "load average" now includes processes that are in a wait-state, it no longer relates to the number of cores.

Personally, I don't find such metrics – however calculated – to be useful at all, because they're an abstract description of what the system is doing as a whole. A system might be doing many different things and there is no separating the wheat from the chaff.

That's why I rig up external processes (like Nagios) to probe the system ... or, I perform statistical analysis of log files (that are appropriately designed). I want to measure the time that it takes for the system to do a particular(!) thing. (And, I want the probes to be taken infrequently, often randomly within some time-range, so that the probing process itself will not skew the results.)

Sometimes, I'll do a "bang bang bang" probe: after a lengthy pause which allows the process that I'm probing to have been swapped out (the process is a dummy designed only to respond to my probe and to otherwise behave in some certain way), I'll hit it three times in quick succession, knowing that the second and third hits represent a process that is seen to be "actively running." Then, there's a much longer (random) wait before I do it (three times ...) again. I mainly look at the standard deviation of the three, and the min/max value.

Log files, however, are best, because they represent a constant load on the system. On a very busy system you can use a (fairly small) ring-buffer in a shared memory segment, and scoop up a copy of the buffer content at periodic or random intervals.

The most useful results can be obtained with simple "statistical quality-control" analyses – like the ones used at the hypothetical light-bulb factory you learned about in school. For instance, you might say, "All class-A activities should complete in less than 2 seconds, 95% of the time, and with a standard deviation of no more than 1.7." Now, you measure how many times our student "passed," and how many times it "failed." (Notice that this is no longer a continuous-valued metric, but a binary one.) You can get results with a very small random sample. Any and all of the necessary statistics can be done with a spreadsheet, or with an open-source tool such as the "R" statistical programming language.

If you discover a problem, sometimes there is a recognizable pattern to it, such as time of day. Or, some other metric (number of jobs, number of logged-in users, number of web requests) that can be shown to be correlated to it. You can often predict where the problem actually is, and what to do about it, based on thoughtful ex post facto analyses of such data, which are cheap and easy to collect. Random sampling makes the actual number of observations manageable. (You can also calculate the sample-size that you need.)