Nagios

nitin-saxena · 10-25-2006, 08:11 AM

Hi,

I have configured Nagios with nrpe plugin for checking the remote hosts.
it is configured and working fine for common plugins,

NAGIOS SERVER libexec $ /usr/local/nagios/libexec/check_nrpe -H remotehost.net
NRPE v2.5.1

NAGIOS SERVER libexec $ /usr/local/nagios/libexec/check_nrpe -H remotehost.net -c check_load
OK - load average: 0.01, 0.00, 0.00|load1=0.010;15.000;30.000;0; load5=0.000;10.000;25.000;0; load15=0.000;5.000;20.000;0;

I have writed down a plugin(in bash),

Now, when i checked that plugin through Nagios server it gives me the old values,

NAGIOS SERVER libexec $ /usr/local/nagios/libexec/check_nrpe -H remotehost.net -c alarm
ALARM OK Errors = 0 Warns = 0

But when i run that plugin manually on the remotehost and try the same again on Nagios Server,
It gives me the correct values,

REMOTE libexec $ /usr/local/nagios/libexec/alarm.sh
ALARM WARNING Errors = 5 Warns = 2

NAGIOS SERVER libexec $ /usr/local/nagios/libexec/check_nrpe -H remotehost.net -c alarm
ALARM WARNING Errors = 5 Warns = 2

Please suggest,

MensaWater · 10-27-2006, 01:50 PM

Nagios/NRPE base their settings on exit values of the command rather than literal text. That is to say if the shell script completes successfully it will have an exist status of 0 (successful) even though you echo the word "ALARM".

You have to build your script so it gives the appropriate exit status AND the text you want to see. Another gotcha is to be sure it only returns ONE line of text.

Typically you go ahead and define the various statuses as variables then return them.

A short example script from one of my servers:

Code:

#!/bin/ksh
#
# This script is used to check running tomcat processes.  Written for NAGIOS.
# cnt returns the number of processes.
#
# USAGE: check_tomcat.sh
#
# 01/05/03 jda - Original write of check_process.sh
# 06-Feb-2006 jlightiner - Adapted check_process.sh to check_tomcat.sh
#                          Modified ps command for cnt.
#
#


OK_STATE=0
WARNING_STATE=1
CRITICAL_STATE=2
cnt=0

cnt=`ps -fxu www |grep java |grep tomcat |grep -c catalina`

if [ $cnt -eq 0 ]
then
        echo "CRITICAL: $1 is DOWN!"
        exit $CRITICAL_STATE
fi

echo "OK: $1 is up"
exit $OK_STATE

As you can see in the script the WARNING_STATE was defined as 1 but it is never actually used in this one. In more sophisticated scripts I have used it and also do a fair amount more testing.

So it is the "exit 2" (exit $CRITICAL_STATE) that would exit with a status of 2 which Nagios recognizes as "CRITICAL". (I could have name the variable FUNNY_STATE if I'd wanted - it is the value of the exit not the name that matters.)

Similarly it is the "exit 0" (exit $OK_STATE) that would exit with a status of 0 which Nagios recognizes as "OK". (Again I could have named this variable something like ALL_IS_GOOD because it is the value rather than the name that is important.)

In fact you don't have to create the variable names at all - you can just put "exit 0", "exit 1" or "exit 2" at the appropriate places. The variables are used mainly so when you look back at the script you'll know which area of it likely caused the state you're seeing in your Nagios web page.

The default exit status for successful commands is 0 and for unsuccessful is 1. If you don't define them then the status you see is the status of the final command in the script. (This is shell scripting basic - not a Nagios thing per se.)

So say you had a script that did something like:
ls -l /billybob
ls -l /suzybob
ls -l /jimmybob

If there were no /billybob or /suzybob directory on your system each of those commands would have an exit 1 (file not found). However if there IS a /jimmybob on your system the exit status of that command would be 0 (successful - and it would show the file). The status of the script that ran all three lines would be 0 because that was the last status it saw even though two thirds of the commands failed.

So to make if you want it to fail if ANY of the directories does you'd have to do:
if ! ls -l /billybob
then echo it failed
exit 1
fi
if ! ls -l /suzybob
then echo It failed
exit 1
fi
ls -l /jimmybob
then echo It failed
exit 1
fi
echo it succeeded
exit 0

Basically the "exit 1" says to fail with exit code 1 which means the script failed. It would never get to the next name in the list because you told it to exit at the point it failed. It would therefore only go to the "it succeeded" message if all 3 had succeeded and would then do the exit 0. (As noted above you wouldn't need the exit 0 as that would be the state of the last command anyway.)