Nagios

bpwoods · 10-05-2009, 10:43 AM

I would like to set Nagios to trigger a "HARD STATE" if a server reports as WARNING or CRITICAL on the first try; but if it is unknown because the check timed out, to retry four times.

Right now I have the template set to test once every minute, and to retry three times. This is working ok, but we have so many servers (thousands), that we still get false hits. It also causes an extra two minute lag on notifications...

Suggestions?

Doculus · 10-06-2009, 12:51 AM

For thousands of hosts I would definitely consider a distributed Nagios setup, and/or using nrpe on the servers themselves, for checking services locally, so the nagios server(s) can get the results more timely, all tests for a host in one request.

bpwoods · 10-06-2009, 02:57 PM

Yep, and yep. We currently have five high end servers (8xCPU, 8xGBs RAM, RAID 0+1x15K) running Nagios. We are also using the NSClient as the majority of our servers are Windows. We can do around 20,000 service level checks per minute at last testing, and we use every bit of it.

But because of the shier volume of checks, statistically we will jump above the three check limit from time to time. And we are on the hook 24/365.

So, we are REALLY hoping to find a way for Nagios to distinguish between "was unable to check" and "was able to check, and there was a failure".

Because sometime a server is just busy. And that is not necessarily something we want to alert on...

bpwoods · 10-13-2009, 12:19 PM

No takers on this?