Anyone using Nagios?

creatorrr · 01-21-2008, 06:46 PM

If so, i would like to know in case you are monitoring a switch with lets say 50 interfaces (you monitoring interfaces as well) what would happen if you shut down the switch for a certain period of time. Would you get only one notification that the switch is down or you'll get 51 notification, that the switch is down as well as notification that each interface is down?

Thank You

JimBass · 01-22-2008, 07:25 AM

It depends how you have nagios setup. It could go either way. The way I config it, I would get a "host down" for the switch itself, and then "host unreachable" for all the interfaces. I also would set the switch itself as the parent of the interfaces. You could however identify each interface as an individual host, in which case you'd get 51 down messages, or if you disable the unreachable messages or play with the dependencies.cfg file, you could get just one down for the switch itself.

You'd have to print out at least the hosts.cfg and services.cfg file for me to answer for sure, and probably dependencies.cfg for me to answer with certainty.

Like most linux apps, nagios can do damn near anything, it is all in how you set it up.

Peace,
JimBass

creatorrr · 01-22-2008, 03:56 PM

Quote:

Originally Posted by JimBass

It depends how you have nagios setup. It could go either way. The way I config it, I would get a "host down" for the switch itself, and then "host unreachable" for all the interfaces. I also would set the switch itself as the parent of the interfaces. You could however identify each interface as an individual host, in which case you'd get 51 down messages, or if you disable the unreachable messages or play with the dependencies.cfg file, you could get just one down for the switch itself.

You'd have to print out at least the hosts.cfg and services.cfg file for me to answer for sure, and probably dependencies.cfg for me to answer with certainty.

Like most linux apps, nagios can do damn near anything, it is all in how you set it up.

Peace,
JimBass

Thanks for your reply,

I have setup the switch as a host using the management VLAN as an IP, and each interface as a service. e.g:

define host{
use generic-switch ;
host_name Sydswcore01 ;
alias Sydney Core Switch 01 ;
address 10.10.1.253 ;
hostgroups switches ;
}

define service{
use generic-service ;
host_name Sydswcore01
service_description Gi2/1 IP:10.10.1.254 WAN-01 Link Status
check_command check_snmp!-C public -o ifOperStatus.1 -r 1 -m RFC1213-MIB
}

Couldnt find the files you mentioned above. Cant remember if i have seen these files before, my version is probably different. I have files like switch.cfg, windows.cfg, commmands.cfg, etc...

I guess i could test and see what happened but i wanted to avoid getting 50 emails for same device...

JimBass · 01-23-2008, 08:44 AM

Yeah, I compile nagios from source, and those files I mentioned get installed if you go through the full install. I have never worked with a packaged version of nagios. Test it an see. Block the nagios boxes ability to get to the VLAN address of the switch, or simply change it to something else (IE tell nagios the switch lives as 172.16.0.235 instead of 10.10.1.253 (assuming there is no 172.16.0.235)). If you start getting emails, shutdown nagios, change the configs back, and restart.

Peace,
JimBass

lord-fu · 01-23-2008, 09:31 AM

Nagios is amazing and takes some time to completely learn, but "playing" with it is what will get you there.
I say mess around and get 50 emails as JimBass suggested. In one location we monitor over 500 hosts and over 1000 services (yes the 3D graph for this is amazing :] ). We receive maybe 10-15 notifications a week.

http://www.nagios.org/faqs/viewfaq.php?faq_id=145

good luck Nagios is w00t!!

JimBass · 01-23-2008, 09:47 AM

Man, I laughed when you worried about 50 emails. I have 8 nagios installs, each monitoring between 20 and 110+ hosts, with multiple services on most hosts. I average probably 100-150 emails a day, and that's assuming nothing actually goes "hard down". I do have mine set so when a front host goes down (like a firewall or router), we are given "host unreachable" messages for all the hosts behind it.

But yeah, nagios is absolutely great. I use it to keep a lookout on everything, and with the nice GUI interface, my boss can look at a webpage and see exactly where a problem is, and what needs to be fixed. Unquestionably a great piece of work.

Peace,
JimBass

creatorrr · 01-23-2008, 05:41 PM

Quote:

Originally Posted by JimBass

Man, I laughed when you worried about 50 emails. I have 8 nagios installs, each monitoring between 20 and 110+ hosts, with multiple services on most hosts. I average probably 100-150 emails a day, and that's assuming nothing actually goes "hard down". I do have mine set so when a front host goes down (like a firewall or router), we are given "host unreachable" messages for all the hosts behind it.

But yeah, nagios is absolutely great. I use it to keep a lookout on everything, and with the nice GUI interface, my boss can look at a webpage and see exactly where a problem is, and what needs to be fixed. Unquestionably a great piece of work.

Peace,
JimBass

I agree, i started using Nagios since few months ago and i have to say i am more and more satisfied. I guess the best way to learn what Nagios can do is to test, i also want to avoid winging from other Sys Admins of how many emails are they getting from Nagios

creatorrr · 01-23-2008, 06:08 PM

by the way, how many devices/services is recomended to monitor with one nagios instance? for example, we have 8 sites with total of 1500 users. would one Nagios instance be capable of monitoring all this?

JimBass · 01-23-2008, 06:21 PM

Yeah, one instance of nagios can easily do that. I don't know what you're monitoring with users, so I don't think that should matter, but no matter, nagios can do it easily. You might want to install it at different locations just for IP ability. All of the things I monitor are done over private IPs, so I need one server on each of those LANs. If you're monitoring things that can all be reached over the net, than one machine can easily handle it.

In regards to your test worries, define a new contact group, with yourself as the only member. Then when you fail the switch, only you'll get the notifies, not all the sysadmins.

Peace,
JimBass

trickykid · 01-23-2008, 06:28 PM

Quote:

Originally Posted by creatorrr

by the way, how many devices/services is recomended to monitor with one nagios instance? for example, we have 8 sites with total of 1500 users. would one Nagios instance be capable of monitoring all this?

Don't leave out OpenNMS though. We're dropping Nagios for OpenNMS currently where I'm employed. Don't get me wrong though, I like Nagios but OpenNMS is way better, especially when the upper management like pretty graphs. It's like Nagios and Cacti rolled into one application and it's smarter too with autodiscovery.

And OpenNMS also will only notify you of one outage if there's something dependent behind it. Say if a router or switch goes down and you monitor ports on the switch, it won't page you on every single port on the switch that is down, etc.

creatorrr · 01-23-2008, 08:58 PM

Quote:

Originally Posted by trickykid

Don't leave out OpenNMS though. We're dropping Nagios for OpenNMS currently where I'm employed. Don't get me wrong though, I like Nagios but OpenNMS is way better, especially when the upper management like pretty graphs. It's like Nagios and Cacti rolled into one application and it's smarter too with autodiscovery.

And OpenNMS also will only notify you of one outage if there's something dependent behind it. Say if a router or switch goes down and you monitor ports on the switch, it won't page you on every single port on the switch that is down, etc.

Hmmm, i spent far too much time with Nagios and Cacti to change them for another Monitoring software. I will have a look thou...

Anyway, these two does everything we need so far....

creatorrr · 01-23-2008, 09:02 PM

Quote:

Originally Posted by JimBass

In regards to your test worries, define a new contact group, with yourself as the only member. Then when you fail the switch, only you'll get the notifies, not all the sysadmins.

Yeah i can do that.

Another question, what practice are you using on yor network in regards of monitoring devices/services? For example, the more sensitive device i am defining to receive notification the first time the devices is unreachable. You know, by the default is 3 times. I am wondering how other people are setting it up on their networks...

JimBass · 01-23-2008, 09:44 PM

I tend to monitor IP cameras over wireless networks, so I usually let the timeout get to 10 times before a notice is sent. I think you'll drive yourself crazy with a first miss message. Even something working well will miss from time to time.

Take for example 2 computers right next to each other, plugged into a hub. Say you ping the second machine from the first all night long. When you go and check it in the morning, you'll find there were a few missed pings. It might have missed 10 out of 100,000, but it won't be exactly 100%.

I wouldn't lower from 3 failures, but its your system. Set it up as you want, and see if it works. If you get "false downs" as a result of sending at too few failures, then increase the number before you get notified. If it is something like a production server doing e-commerce, than you might well want a notice at the first failure.

Peace,
JimBass

creatorrr · 01-23-2008, 10:15 PM

Quote:

Originally Posted by JimBass

I tend to monitor IP cameras over wireless networks, so I usually let the timeout get to 10 times before a notice is sent. I think you'll drive yourself crazy with a first miss message. Even something working well will miss from time to time.

Take for example 2 computers right next to each other, plugged into a hub. Say you ping the second machine from the first all night long. When you go and check it in the morning, you'll find there were a few missed pings. It might have missed 10 out of 100,000, but it won't be exactly 100%.

I wouldn't lower from 3 failures, but its your system. Set it up as you want, and see if it works. If you get "false downs" as a result of sending at too few failures, then increase the number before you get notified. If it is something like a production server doing e-commerce, than you might well want a notice at the first failure.

Peace,
JimBass

Yeah, i think you are right. OK, one last question. I cant get to monitor for example, d or e drive on a windows machine. i have this for c drive:

check_command check_nt!USEDDISKSPACE!-l c -w 70 -c 80

I tried this for d but doesnt work:

check_command check_nt!USEDDISKSPACE!-l d -w 70 -c 80

Any sugestions?

creatorrr · 01-23-2008, 10:27 PM

Quote:

Originally Posted by creatorrr

I cant get to monitor for example, d or e drive on a windows machine. i have this for c drive:

check_command check_nt!USEDDISKSPACE!-l c -w 70 -c 80

I tried this for d but doesnt work:

check_command check_nt!USEDDISKSPACE!-l d -w 70 -c 80

Didnt work coz i am an idiot. d drive was cdrom, changed it to e and its fine now....