How check all queued processes at specific period

cesarsj · 07-30-2019, 05:45 PM

Zabbix warns us several times during the night that there are an average of more than 5 cpu queued processes on 2 cpus storage. The trigger is this one below:

{<server>:system.cpu.load[percpu,avg1].last(0)}>5

I would like to save to a file the list of all queued processes during the alerts period. So that I could manage them, how could I do that?

The ps command in your manual says that there is the state:

R running or runnable (in the execution queue)

Can't filter to see only runnables?

The next source says that the command below shows me only those in the queue:
https://www.commandlinefu.com/comman...th-a-status-of

ps -eo stat, pid, user, command | egrep "^ STAT | ^ D | ^ R"

Is that so?

Is there another command that best suits my interest?

Another solution would be for me to adjust this trigger. But first, I think I better see what the processes are.

syg00 · 07-30-2019, 06:28 PM

At any given instant, you can't know (in advance) the difference between running or runnable - hence the single metric.
The command from the fu site looks ok; note the comments re state "D" - likely to be your concern. These type of alerts are largely misleading IMHO.

rnturn · 08-01-2019, 03:55 PM

Quote:

Originally Posted by cesarsj

Zabbix warns us several times during the night that there are an average of more than 5 cpu queued processes on 2 cpus storage. The trigger is this one below:

{strg05.unipam.edu.br:system.cpu.load[percpu,avg1].last(0)}>5

I would like to save to a file the list of all queued processes during the alerts period. So that I could manage them, how could I do that?

Just about any process could be waiting for CPU time.

Frankly, a queue of five waiting jobs doesn't sound so bad---it surely doesn't seem like the system's being Slashdotted or anything like that. At one site, we used to see alerts like this all the time in Nagios. Users always wanted to schedule all of their jobs to run at midnight because a.) they were processing the previous day's transactions so running them at midnight was necessary (though running them some time after midnight never occurred to them) and b.) they assumed they'd be the only ones using the system at night, conveniently forgetting about the database loads that ran all night, the system backups, etc., etc. As a result we had four beefy CPUs that were, towards the end of the month, saturated with a system load often over 30 for several hours at a time. Even during the day, if response time was slower than normal, some users of the middle part of the three-tier application would resubmit insanely complex ad hoc database queries thinking that because it didn't return results immediately, it must not have "taken" (you know, like a failed vaccination)---now two of them are running. It was a mess until we educated the user community about the ways to run their jobs sequentially rather than in parallel, spread the job start times, and, best of all, let the people whose job it was to schedule jobs within the job scheduler do their job. Fortunately, we weren't getting calls in the wee hours about the load though I got more than one call during the day about why sendmail wasn't emailing job results during the periods of high load.

Cheers...

cesarsj · 08-01-2019, 07:07 PM

I was wrong to trigger below:

({TRIGGER.VALUE}=0 and {<server>:system.cpu.load[percpu,avg1].last(0)}>5) or ({TRIGGER.VALUE}=1 and {<server>:system.cpu.load[percpu,avg1].min(10m)}>5)

What this trigger does is, it will be a problem if the last collected value is less than 5, and it will be recovery if one of the last collected values is less than 5.

What I would like is that it would be recovery if ALL values collected in the last 10 minutes were greater than 5. How could I adjust the trigger for this case?

Any better ideas?

cesarsj · 08-01-2019, 07:31 PM

I think I could understand the min and max functions, and I think the expression below will be better!

({TRIGGER.VALUE} = 0 and {<server>:system.cpu.load[percpu,avg1[.last(0)}>5) or ({TRIGGER.VALUE} = 1 and {<server>:system.cpu.load[percpu,avg1[.max(10m)}>5)