LinuxQuestions.org - [SOLVED] Nagios/NRPE

Hey,

Thank you for your response. I am running Nagios 3.3.1 and NRPE 2.13 on my main server, which I will refer to as A1. I have modified nrpe.cfg to let A1 recognize itself as an allowed_host. Here is where my plugin differs from a typical case: The plugins I have written use a Perl module to retrieve device statistics from remote servers which do not use NRPE at all. Rather, I use A1 as a hub which locally executes the plugins and thus asks the remote server to send back it's statistics, which are then formatted and returned to Nagios.

Another detail worth noting is that since I am using Nagiosgraph and rrdtool to graphically represent my performance data in the web interface, I needed to create a host object for each device which A1 communicates with even though the -H flag of check_nrpe is always 127.0.0.1 (localhost) for these plugins to use the scripts on A1. I have set the IP addresses for each of these hosts as 127.0.0.1 since the services correspond to one of these hosts and the rrd files are thus written to the appropriate host's rrddir.

As I said before, I am using xinetd with NRPE. I have tried changing the definition of check_nrpe with the inclusion of the -t option in commands.cfg and have also changed the value of command_timeout in nrpe.cfg - each to no avail. Is this because -t and command_timeout are only processed if the NRPE daemon is used?

I have also debugged the plugins and checked the logs to find any clues. As expected, the xinetd entry for the attempted execution of the plugin contains nrpe signal=13 and the duration exceeds ten seconds. However, in successful tries the nrpe status=0 and the durations of successful tries are generally 20 seconds or less. This is what confused me, because my timeout settings are well over a minute but NRPE seems to perceive the threshold as about 20-30 seconds. Here are example log entries:

May 24 14:31:35 A1 xinetd[12471]: START: nrpe pid=28120 from=127.0.0.1
May 24 14:31:35 A1 xinetd[12471]: EXIT: nrpe status=0 pid=27640 duration=20(sec)
May 24 14:31:36 A1 xinetd[12471]: EXIT: nrpe status=0 pid=27662 duration=20(sec)
May 24 14:31:41 A1 xinetd[12471]: EXIT: nrpe status=0 pid=27709 duration=14(sec)
May 24 14:31:42 A1 xinetd[12471]: START: nrpe pid=28373 from=127.0.0.1
May 24 14:31:49 A1 xinetd[12471]: EXIT: nrpe status=0 pid=28373 duration=7(sec)
May 24 14:32:36 A1 xinetd[12471]: EXIT: nrpe signal=13 pid=28120 duration=61(sec)

This has to be an issue with NRPE not accepting my timeout values, right? If I attempt to execute a series of service checks manually, the first few may return valid output and perfdata but some of these checks return the socket timeout - leaving gaps in my graphs and discontinuity in my data. The output of the plugins themselves are as follows:

[root@A1 libexec]# ./check_nrpe -H 127.0.0.1 -c vperf_mhz_a; ./check_nrpe -H 127.0.0.1 -c vperf_disk_a; ./check_nrpe -H 127.0.0.1 -c vperf_pct_a; ./check_nrpe -H 127.0.0.1 -c vperf_sys_a

OK - 135 MHz; |mhz=102;109;95;173;144;127;126;167;148;167;
OK - 5 disk; |disk=2;2;22;2;2;2;2;5;9;3;
OK - 60%; |pct=46;40;74;61;54;54;71;63;71;75;
CHECK_NRPE: Socket timeout after 10 seconds.

Lastly, I want to note that I have set the correct file permissions for nagios.nagios to access them and the directories which are involved in the execution of these plugins. Thank you in advance for any help you can offer.

-Jeff