NIce History

rajwinder · 06-06-2008, 07:34 AM

qq .. we have a box which we monitor via Nagios. Now some times we get an alert saying "High CPU" but when we go on the box the process that caused the hi CPU came to noraml lets say after 10 min or so. NOw how can we see that what spiked the CPU 2-3 hours back ?

unSpawn · 06-06-2008, 08:14 AM

Nagios check_procs doesn't give you too much info, so AFAIK you can't, unless you already sent or saved detailed process information.

rajwinder · 06-06-2008, 08:26 AM

Quote:

Originally Posted by unSpawn

Nagios check_procs doesn't give you too much info, so AFAIK you can't, unless you already sent or saved detailed process information.

Yep Nagios wouldnt provide that ... so the onlny solution is to write a custom script to run every minute or so take a snapshot of top and store it ?

unSpawn · 06-06-2008, 10:11 AM

It depends on what you're monitoring performance for. If you just need to have detailed nfo anyway maybe you should look into some kind of database-backed solution (search Freshmeat, Sourceforge, Savannah, Berlios). If you OTOH only need it for assessing what's going wrong *now* you could run something like 'atop', which writes detailed stats to file you can step through and replay later on, or have for instance check_load trigger something polling the box over SNMP or HTTP and return output from like '/bin/ps -eo %C -eo pid,command | grep -v '^ 0.0''. OTOH if you have no idea what is or are the bottlenecks you may want to look into more generic stats first like Atsar or SAR or Dstat or Collectl (which both combine output from the sysstat package tools).

trickykid · 06-06-2008, 11:42 AM

This is a reason most places I've worked ditched Nagios for OpenNMS which has the capability to graph resources. So you can view a complete history of cpu usage, disk space, network traffic, etc.

unSpawn · 06-06-2008, 11:53 AM

Quote:

Originally Posted by trickykid

OpenNMS which has the capability to graph resources. So you can view a complete history of cpu usage, disk space, network traffic, etc.

Sure, shiny graphs are easy for overview, but can you *still* zoom in on the gory per-process details with it?..

rajwinder · 06-06-2008, 12:32 PM

WEll Hobbit is better then nagios in this respect where we can have snapshot of top.

But I got ur point guys ... I am not looking for some specific process here so a general Top in non interactive mode will do for me ..

Thanks

trickykid · 06-07-2008, 08:45 AM

Quote:

Originally Posted by unSpawn

Sure, shiny graphs are easy for overview, but can you *still* zoom in on the gory per-process details with it?..

Yup, it supports zooming in on the graph to see smaller time increments within the given window you're viewing..

Oh wait, you wanted to know other details of each process...

unSpawn · 06-07-2008, 11:47 AM

Quote:

Originally Posted by trickykid

Oh wait

I'd rather not wait until you manage to add another of your "invaluable expert" replies.

trickykid · 06-08-2008, 12:35 AM

Quote:

Originally Posted by unSpawn

I'd rather not wait until you manage to add another of your "invaluable expert" replies.

What's that suppose to mean? Are you joking or were you actually serious?

I thought by providing information about OpenNMS in which it graphs would or could give insight to the users problem, they could at least see if the CPU load actually did spike. My experience with Nagios sometimes provided false alerts. At least with OpenNMS, monitoring not only CPU but networking, processes and just about anything else, it would be easier to narrow down the culprit if there was indeed a CPU load or spike.

unSpawn · 06-08-2008, 06:25 AM

Quote:

Originally Posted by trickykid

I thought by providing information about OpenNMS in which it graphs would or could give insight to the users problem, they could at least see if the CPU load actually did spike.

That would only apply if I reacted to something in your reply to the OP, which I did not.

Quote:

Originally Posted by trickykid

What's that suppose to mean?

I asked you a question to which you replied

Quote:

Originally Posted by trickykid

Oh wait, you wanted to know other details of each process...

.
So what kind of response is that? What kind of value does a reply like that have?

trickykid · 06-08-2008, 08:33 AM

Quote:

Originally Posted by unSpawn

That would only apply if I reacted to something in your reply to the OP, which I did not.

I asked you a question to which you replied .
So what kind of response is that? What kind of value does a reply like that have?

So only half of my reply gets a reply from you? I'm actually offended by your first response to it in which I questioned. You make it sound as if *all* my replies on this forum are of "invaluable expert." If that's the case, I'll just stop contributing if you honestly feel that way.

That portion of my reply was being half sarcastic and also realizing you were implying that *zoom* in on gory process details was for individual processes, not just taking a snapshot of the load. That's all. But with some custom graphing and monitors, I'm sure it's possible with OpenNMS. Does that satisfy you as a valuable response? I'll just be sure to stop any light hearted discussions in any threads you participate in okay.

unSpawn · 06-09-2008, 12:13 PM

The answer will do, thanks.

markseger · 06-09-2008, 03:27 PM

Hmm... I just saw the note from unSpawn which said "OTOH if you have no idea what is or are the bottlenecks you may want to look into more generic stats first like Atsar or SAR or Dstat or Collectl (which both combine output from the sysstat package tools)."

As the author of collectl I just want to say collectl has nothing to do with sysstat - it's a completely separate, standalone tool. Also, since the posting of this note I've been adding a lot of extra goodies such as monitoring process I/O stats if you have the right kernel. Someone had mentioned detailed process monitoring and while collectl by default only looks at processes once every 60 seconds to keep the load down, if you tell it to look at specific processes you can monitor them every second or so and not generate any appreciable load. That means you can watch memory, cpu, i/o, page faults over time.

There was also mention about watching memory, and while slab monitoring is system-wide, if you do have a few slabs that are growing uncontrolled you can sometimes figure out who's using them just by their name or you can google them and learn more too.

Anyhow be sure to check out http://collectl.sourceforge.net/ and within the next couple of days of this posting I expect to release version 2.6.4 which will have the capability of showing top I/O users in much the same way the top command can show top cpu consumers. Stay tuned...

-mark

unSpawn · 06-09-2008, 04:18 PM

Quote:

Originally Posted by markseger

Hmm... I just saw the note from unSpawn which said "OTOH if you have no idea what is or are the bottlenecks you may want to look into more generic stats first like Atsar or SAR or Dstat or Collectl (which both combine output from the sysstat package tools)."

As the author of collectl I just want to say collectl has nothing to do with sysstat - it's a completely separate, standalone tool.

You just misread my remark. If I re-phrase it like "... (Atsar or SAR) or (Dstat or Collectl), the last two aggregate output somewhat similar to running all tools from the sysstat package at once." it should be more clear I think.