LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   Nagios sending alerts for levels that are too low (http://www.linuxquestions.org/questions/linux-software-2/nagios-sending-alerts-for-levels-that-are-too-low-938176/)

buee 04-04-2012 09:00 PM

Nagios sending alerts for levels that are too low
 
I have Nagios set up to monitor my network and numerous servers. It runs on a Debian box and I'm having an issue with monitoring a Ubuntu box via NRPE. All other checks work on this machine, but the total processes check. I've set it up to warn at 200 and go critical at 250. I've even pushed it up to 300/400 but I still get e-mails at 160 processes, etc. I've restarted both machines, the NRPE service, and the Nagios service, it doesn't matter, still alerting at a level that's simply too low.

Here is the Nagios definition
Code:

define service {
        use                    generic-service
        host_name              File Server
        service_description    Total Processes
        check_command          check_nrpe_procs
        }

Here is the command call for "check_nrpe_procs"
Code:

# NRPE Total Procs
define command{
        command_name    check_nrpe_procs
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c "check_total_procs"
        }

Here is the NRPE definition
Code:

command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 300 -c 350
It defies logic. Anyone else run in to this? Better yet, anyone have a fix for it?

MensaWater 04-06-2012 07:49 AM

It seems to work for me from command line.

What happens on the nrpe client if you run the following from command line (that is bypass the Nagios master for a second)?

/usr/lib64/nagios/plugins/check_procs -w 300 -c 350

When you get the warning on the Nagios master what exactly does it show on the full line?

From the Nagios master if you run from command line:

check_nrpe -H <hostname> -c "check_total_procs"

do you get the same response as when you run it as:

check_nrpe -H <ip address> -c "check_total_procs"

That is run it from command line and specify actual host name in first invocation and host's IP address in second invocation. I've seen many an issue in Nagios caused because hosts.cfg had the wrong IP.

By breaking down your testing to various levels you can determine where it is breaking. The first test eliminates the Nagios master software and NRPE configuration. The latter two tests check functionality of check_nrpe to the nrpe client so eliminate the rest of the Nagios master setup. If it works at one level but not at others you can focus your efforts where it doesn't.

buee 04-07-2012 02:33 PM

Quote:

Originally Posted by MensaWater (Post 4646356)
It seems to work for me from command line.

What happens on the nrpe client if you run the following from command line (that is bypass the Nagios master for a second)?

/usr/lib64/nagios/plugins/check_procs -w 300 -c 350

When you get the warning on the Nagios master what exactly does it show on the full line?

From the Nagios master if you run from command line:

check_nrpe -H <hostname> -c "check_total_procs"

do you get the same response as when you run it as:

check_nrpe -H <ip address> -c "check_total_procs"

That is run it from command line and specify actual host name in first invocation and host's IP address in second invocation. I've seen many an issue in Nagios caused because hosts.cfg had the wrong IP.

By breaking down your testing to various levels you can determine where it is breaking. The first test eliminates the Nagios master software and NRPE configuration. The latter two tests check functionality of check_nrpe to the nrpe client so eliminate the rest of the Nagios master setup. If it works at one level but not at others you can focus your efforts where it doesn't.

One problem I have is that it's not constant. The alert usually comes through later in the evening and is followed in 5-10 minutes by the recovery alert. By running the check_procs on the local machine, it comes back OK, but the # of processes are below the 150 mark, too, so I find it irrelevant.

Code:

root@fileserver:~# /usr/lib64/nagios/plugins/check_procs -w 300 -c 350
PROCS OK: 141 processes

If, from the master, I run the command
Code:

check_nrpe -H fileserver -c "check_total_procs"

-OR-

check_nrpe -H "File Server" -c "check_total_procs"

As the host is configured as File Server with an alias of fileserver. The first one brings back invalid host, the second one returns:

Code:

root@monitor:~# /usr/lib/nagios/plugins/check_nrpe -H 192.168.168.3 -c "check_total_procs"
PROCS OK: 136 processes

Unless you know of some way to temporarily create artificial processes, which I'd be open to try, I have a ~5 minute windows of opportunity to figure it out.

buee 04-07-2012 10:11 PM

Well, I dropped the '64' off the lib64, I assume that loads the 32 bit plugin, and voila! It works. I tested it by starting a few scripts and hitting Ctrl+Z on all of them. It went critical at 200+, then I killed a few of those spawns to get it under 200 but above 150 and it went to warning. Then I killed the rest to get it back to a normal operation and I got recovery. Must be a bug or something in the 64 bit version of the plugin?


All times are GMT -5. The time now is 05:23 PM.