[SOLVED] Nagios/NRPE

jrb328 · 05-21-2012, 01:26 PM

Hello,

I am running a few self-made plugins using NRPE. It is configured to utilize xinetd to communicate with remote servers. The first few executions of my plugin return the output I expect, with continuous executions returning CHECK_NRPE: Socket timeout after 10 seconds.

I am attempting to use the localhost IP address (127.0.0.1) as the IP for several hosts, each of which would display performance data on Nagiosgraph with rrd files located in the corresponding hosts' rrddir. I can only execute the plugins by passing 127.0.0.1 as the -H argument value (as opposed to using the specific hostname). This is the method of execution which leads to intermittent failures.

Thanks

ratotopi · 05-21-2012, 03:49 PM

are all your server running nrpe also running nagios ?? if not you will have to allow your nrpe to accept at least one server IP that is running nagios.

jrb328 · 05-24-2012, 09:46 AM

Hey,

Thank you for your response. I am running Nagios 3.3.1 and NRPE 2.13 on my main server, which I will refer to as A1. I have modified nrpe.cfg to let A1 recognize itself as an allowed_host. Here is where my plugin differs from a typical case: The plugins I have written use a Perl module to retrieve device statistics from remote servers which do not use NRPE at all. Rather, I use A1 as a hub which locally executes the plugins and thus asks the remote server to send back it's statistics, which are then formatted and returned to Nagios.

Another detail worth noting is that since I am using Nagiosgraph and rrdtool to graphically represent my performance data in the web interface, I needed to create a host object for each device which A1 communicates with even though the -H flag of check_nrpe is always 127.0.0.1 (localhost) for these plugins to use the scripts on A1. I have set the IP addresses for each of these hosts as 127.0.0.1 since the services correspond to one of these hosts and the rrd files are thus written to the appropriate host's rrddir.

As I said before, I am using xinetd with NRPE. I have tried changing the definition of check_nrpe with the inclusion of the -t option in commands.cfg and have also changed the value of command_timeout in nrpe.cfg - each to no avail. Is this because -t and command_timeout are only processed if the NRPE daemon is used?

I have also debugged the plugins and checked the logs to find any clues. As expected, the xinetd entry for the attempted execution of the plugin contains nrpe signal=13 and the duration exceeds ten seconds. However, in successful tries the nrpe status=0 and the durations of successful tries are generally 20 seconds or less. This is what confused me, because my timeout settings are well over a minute but NRPE seems to perceive the threshold as about 20-30 seconds. Here are example log entries:

May 24 14:31:35 A1 xinetd[12471]: START: nrpe pid=28120 from=127.0.0.1
May 24 14:31:35 A1 xinetd[12471]: EXIT: nrpe status=0 pid=27640 duration=20(sec)
May 24 14:31:36 A1 xinetd[12471]: EXIT: nrpe status=0 pid=27662 duration=20(sec)
May 24 14:31:41 A1 xinetd[12471]: EXIT: nrpe status=0 pid=27709 duration=14(sec)
May 24 14:31:42 A1 xinetd[12471]: START: nrpe pid=28373 from=127.0.0.1
May 24 14:31:49 A1 xinetd[12471]: EXIT: nrpe status=0 pid=28373 duration=7(sec)
May 24 14:32:36 A1 xinetd[12471]: EXIT: nrpe signal=13 pid=28120 duration=61(sec)

This has to be an issue with NRPE not accepting my timeout values, right? If I attempt to execute a series of service checks manually, the first few may return valid output and perfdata but some of these checks return the socket timeout - leaving gaps in my graphs and discontinuity in my data. The output of the plugins themselves are as follows:

[root@A1 libexec]# ./check_nrpe -H 127.0.0.1 -c vperf_mhz_a; ./check_nrpe -H 127.0.0.1 -c vperf_disk_a; ./check_nrpe -H 127.0.0.1 -c vperf_pct_a; ./check_nrpe -H 127.0.0.1 -c vperf_sys_a

OK - 135 MHz; |mhz=102;109;95;173;144;127;126;167;148;167;
OK - 5 disk; |disk=2;2;22;2;2;2;2;5;9;3;
OK - 60%; |pct=46;40;74;61;54;54;71;63;71;75;
CHECK_NRPE: Socket timeout after 10 seconds.

Lastly, I want to note that I have set the correct file permissions for nagios.nagios to access them and the directories which are involved in the execution of these plugins. Thank you in advance for any help you can offer.

-Jeff

jrb328 · 05-30-2012, 08:57 AM

I figured out the issue here. The change to the timeout variable did take effect, but there was also a change to the nrpe command. In adding the -t 120 flag to the command, it changed the configuration so that in order for NRPE to use the correct timeout value which was set, the -t 120 flag must be in every command run manually.

In commands.cfg, check_nrpe was changed to:

define command{
command_name check_nrpe
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 120 -c $ARG1$
}

If I run ./check_nrpe -c *command_name*, the timeout is defaulted to 10 seconds.

However, if I run ./check_nrpe -t 120 -c *command_name* the appropriate timeout value is used, which was set in nrpe.cfg as:

command_timeout=120

-Jeff