Nagios reports host as down, services as OK
Since a week or 2 Nagios is constantly marking hosts (servers mainly but also a few Serial-over-IP converters) down for anything up to a few minutes. Typically, all services stay in OK status. Looking closer at a host in such state, it's status information is
CRITICAL - Packet Filtered (<IP address of host in question>)
in soft state.
Sometimes services are in critical state with the host in OK status. Almost always the status information is "No route to host". Further checking shows no problems. Rarely this state lasts longer than one check interval.
This started after a link was down, putting, correctly, all hosts and services on red for being unreachable. The link problems were solved within a few hours but Nagios only showed this after 2 reboots. Since then the problems has lessened in frequency gradually, from 5 to 10 of the 34 hosts being reported down at any given moment (the same for the 150-ish services monitored) to where I am now, 1 to 5 problem statuses (counting both hosts and services) and the occasional 'all green' screen.
A week ago, when the problems were going a week already, Nagios updated from 3.2.0 to 3.2.1. This showed no apparent improvement.
Still, the Host Groups screen is not stable. A read or yellow status initially signified a problem to be looked at, right now it is likely a false alarm to will go away.
Whenever the "packet filtered" or "no route to host" is followed up by a ping test, no problems, not even with the slightest delay or packt loss are found.
nagios is telling you what it is doing, namely it is running the check-host-alive test and something
is blocking the result from being returned (that's what is meant by Packet filtered). By default the
check-host-alive test is usually a ping. I'd double check what your check-host-alive command is and
then investigate why your network is apparently not allowing responses. You may need to set up more
advanced testing to see what is happening, since Nagios only checks on a schedule and doesn't let you
see what's going on in real time.
Incidentally, it makes sense that the services would appear as okay if pings are being intermittently
dropped, provided that the other tests don't rely on ping.
Sounds to me like there is a problem with the nagios plugins reaching the host(s) they are to check. Could there be a routing issue, perhaps a bad switch port, maybe even a bad nic or cable?? I bring this up as a possibility because you said there was a problem.
One other possible problem comes to mind. A high load of network traffic or CPU usage on the nagios box could cause packets to be received too late to count or not at all. Check the server load and the network load on both the Nagios Server AND the server that is reported.
If this is a fairly frequent problem creative use of tcpdump and grep might help locate the problem
Nagios schedules checks for services and hosts based on your configuration, this is why sometimes a host is marked down and the services are still marked as up (Host checks are just pings)
I have a Nagios Server that runs on a box that also does spam filtering. If I get slammed with mail, I will have these issues.
I made an estimate. 181 hosts and services are monitored and the mentioned statuses are now at an average of 4. There is no host or service being constantly blocked. There is no particular host or service singled out. Having said that, checking the check-host-alive command might give a clue or at the very least I should see what it does.
Initially, after the down-and-up of the WAN link, Nagios kept reporting all host and services down. After one reboot the next morning a few came back again, a second reboot that afternoon brought all back but with 30% dropout. Since I did a few reboots but they had no immediate effect.
never say never -
The issue might indeed be on the box. The problems occur on hosts on the same subnet, on different subnets within the LAN and on subnets reached through a WAN connection. Also, hosts are device servers, Windows and Linux servers. In the case of the device servers (Quatech ESE-100D), there is no client and only ping and uptime are monitored. The problem is almost certainly just in the traffic between the Nagios box and the clients.
I can't imagine how the sequence of events would have triggered this but there might be something coincidental. I will check the general state of the Nagios box.
I have been seeing similar things regarding a couple of hosts that report to be down but as soon as I see this and check them by trying to manually ping or ssh to them they appear fine. The cpu load is not high and I don't see a lot of traffic.
I was just wondering if you ever came to any conclusions that you could pass on regarding your situation of hosts down but are actually up.
eblonk, are you still seeing the problem? I had forgotten about this
thread, and didn't suggest the next step. See if powercycling the
hubs and/or switches makes any difference. It is possible for some
switches to start acting weird after being on for a long time, especially
if certain network conditions exist.
I came through this post when googling my problem..
I have Nagios 3 installed on Centos 5, Nagios shows hosts down while they are up !
I believe that everything is just fine with my configurations.. I'm monitoring 3 hosts, 1 of them is on the same network that Nagios on and it Nagios can ping it. The others are monitored by Nagios through DLink router (dir-100) which acts as their parent.. so Nagios shows the router (parent) is live while the hosts behind it are down although they up!
would you help me please ?
A long time after this post but I want to finish it properly. I left that place where I had that problem not long after my last post, so it is out of my hands.
I'm bringing this post back up as I'm having the same issue, I found it via Google.
Via TCPDump, I am getting packets (CentOS with Apache and MySQL) from Nagios and responding to them, but Nagios is showing I'm down. I've reset my CentOS, set everything to default in httpd.config and reconfigured it, but have had no luck. Nagios does not respond to my ping requests, but other hosts on my network do. I have a guess that Nagios isn't set to receive my responses or there's an issue with the routing.
Network from outside to in-
Switch --- Vyatta (off) --- NAT --- virtual OS (CentOS and others). Nagios is on a server that is on another Ethernet port on the switch.
Nagios reports host as down,services as OK
I have the same issue.
I have servers on Amazon and Rackspace. My monitoring server is on Racksapce. The servers on Rackspace are fine but Nagios says that the servers on Amazon are down; even though the services are up. I now realise that Amazon disables ping or maybe disables it from servers outside the Amazon network. Thats the reason why Nagios says my Amazon servers are down.
Hope this helps.
|All times are GMT -5. The time now is 09:35 PM.|