NIce History
qq .. we have a box which we monitor via Nagios. Now some times we get an alert saying "High CPU" but when we go on the box the process that caused the hi CPU came to noraml lets say after 10 min or so. NOw how can we see that what spiked the CPU 2-3 hours back ?
|
Nagios check_procs doesn't give you too much info, so AFAIK you can't, unless you already sent or saved detailed process information.
|
Quote:
|
It depends on what you're monitoring performance for. If you just need to have detailed nfo anyway maybe you should look into some kind of database-backed solution (search Freshmeat, Sourceforge, Savannah, Berlios). If you OTOH only need it for assessing what's going wrong *now* you could run something like 'atop', which writes detailed stats to file you can step through and replay later on, or have for instance check_load trigger something polling the box over SNMP or HTTP and return output from like '/bin/ps -eo %C -eo pid,command | grep -v '^ 0.0''. OTOH if you have no idea what is or are the bottlenecks you may want to look into more generic stats first like Atsar or SAR or Dstat or Collectl (which both combine output from the sysstat package tools).
|
This is a reason most places I've worked ditched Nagios for OpenNMS which has the capability to graph resources. So you can view a complete history of cpu usage, disk space, network traffic, etc.
|
Quote:
|
WEll Hobbit is better then nagios in this respect where we can have snapshot of top.
But I got ur point guys ... I am not looking for some specific process here so a general Top in non interactive mode will do for me .. Thanks |
Quote:
Oh wait, you wanted to know other details of each process... ;) |
Quote:
|
Quote:
I thought by providing information about OpenNMS in which it graphs would or could give insight to the users problem, they could at least see if the CPU load actually did spike. My experience with Nagios sometimes provided false alerts. At least with OpenNMS, monitoring not only CPU but networking, processes and just about anything else, it would be easier to narrow down the culprit if there was indeed a CPU load or spike. |
Quote:
Quote:
Quote:
So what kind of response is that? What kind of value does a reply like that have? |
Quote:
That portion of my reply was being half sarcastic and also realizing you were implying that *zoom* in on gory process details was for individual processes, not just taking a snapshot of the load. That's all. But with some custom graphing and monitors, I'm sure it's possible with OpenNMS. Does that satisfy you as a valuable response? I'll just be sure to stop any light hearted discussions in any threads you participate in okay. |
The answer will do, thanks.
|
Hmm... I just saw the note from unSpawn which said "OTOH if you have no idea what is or are the bottlenecks you may want to look into more generic stats first like Atsar or SAR or Dstat or Collectl (which both combine output from the sysstat package tools)."
As the author of collectl I just want to say collectl has nothing to do with sysstat - it's a completely separate, standalone tool. Also, since the posting of this note I've been adding a lot of extra goodies such as monitoring process I/O stats if you have the right kernel. Someone had mentioned detailed process monitoring and while collectl by default only looks at processes once every 60 seconds to keep the load down, if you tell it to look at specific processes you can monitor them every second or so and not generate any appreciable load. That means you can watch memory, cpu, i/o, page faults over time. There was also mention about watching memory, and while slab monitoring is system-wide, if you do have a few slabs that are growing uncontrolled you can sometimes figure out who's using them just by their name or you can google them and learn more too. Anyhow be sure to check out http://collectl.sourceforge.net/ and within the next couple of days of this posting I expect to release version 2.6.4 which will have the capability of showing top I/O users in much the same way the top command can show top cpu consumers. Stay tuned... -mark |
Quote:
|
All times are GMT -5. The time now is 08:37 PM. |