Quote:
Originally Posted by AST
Code:
Check_interval - 1
retry_interval - 1
Max_check_attempts - 2
notification_interval - 1
notification_options - d,r
|
You've configured Nagios to alert you upon recovery (good), so you'll get an alert saying all is well. This is also useful if you would have traveled ten miles to fix a problem which recovers before you've driven two blocks.
Quote:
Originally Posted by AST
1. 1st email to be sent out after 3 minutes and then every 1 minute until recovery (like it is now)
|
Careful... you don't want to go insane with 180 texts "reminding" you a host is down. Saving SMS money is good, but after half the night awake with ten text messages, you won't care about your wireless invoice as much as your pillow. This is not a "set-it-and-forget-it" tool--it's very hands-on.
Nagios is my buddy; it has been watching my cable modem for about four years. I liked it so much I built and maintained a Nagios server (around 200 hosts, lots of ping, some telnet, SNMP, ssh) at my last job.
First--this is easier to configure than escalation--
increase your notification interval. Certainly, critical systems might require a lower interval (our UPSes were 5 min), but the interval is to give you time to log into Nagios (not nag you constantly that something is still broken). Then you can acknowledge the service issue.
Notice I didn't say "disable" the alert; disabling a notification is a VERY rare need. The "acknowledge" button is
key to a successful Nagios installation. You should acknowledge every alert as opposed to waiting; the very idea of monitoring is to help you become more proactive in providing a network service. It's Nagios's job to tell you there's a problem, and yours is to fix what's needed. Whether you handle it now or in the morning, Nagios will notice once the problem is gone or gets worse.
Quote:
Originally Posted by AST
2. An SMS to be sent out the same time as the first host down email is sent
3. Then an SMS alert to be sent again 3 and 6 minutes after the first notification.
4. And then no more SMS sent until the host is recovered. (me or my other admin should be working on the problem after 3 SMS alerts!)
|
You can define multiple contacts for a service/host/group, including a real email account or a pager or phone. Each contact can (I believe) have a different notification interval. Most wireless providers have an email portal, which is simpler to configure than SMS. Nagios sends my Verizon phone a short message by mailing
1234567890@vtext.com.**
Note that Nagios is precisely the sort of uber-cool project that be "over-geeked." If you're not going to do anything about an alert, consider whether you really need it. The first time you receive a page at 3am when a printer goes offline is a geeky thrill, but that wears off.
:-)
Let Nagios do the boring part for you by watching services and hosts, but then resolve to react when it does. If you get a notification every few minutes, you become desensitized to the very alerts which are designed to help you fix a problem quickly, so no one knew about it but you.
Other notes: Planned downtime is a great way to avoid getting a ton of pages for an OS upgrade or replacing a router. The parent/child relationships are well worth a look if you have multiple switches and routers, but not really for a flat LAN. The documentation that comes with Nagios is extensive; you'll see plenty on escalation there.
There's also an Android Nagios client, but I haven't gotten to try it yet. Finally, congrats on deploying Nagios. It's a complex beast, but very powerful and completely worth it.
** my real number. Go ahead, try it.