I'm configuring Nagios to monitor my HP ProCurve switches. I found excellent command and service definitions at
www.nagiosexchange.org and all is working wonderfully except for the service that monitors free memory.
The definitions that I'm using are copied and pasted directly from the above-mentions site. They are:
command in commands.cfg:
Code:
define command{
command_name check_hpmemoryfree
command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.11.2.14.11.5.1.1.2.1.1.1.6.1 -t 5 -w $ARG2$ -c $ARG3$ -u bytes -l free
}
service in switch.cfg:
Code:
# Service definition MEM-FREE
define service{
use generic-service ; Name of service template to use
host_name Switch_MDF-1
service_description MEM-FREE
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
notification_interval 240
notification_period 24x7
notification_options c,r
check_command check_hpmemoryfree!nagios!2000:30000000!1000:30000000
The switches list about 150MB of total memory, with about 109MB free when I view status from the switch console itself. Nagios is correctly reporting the free 109MB, but is showing the state as critical.
I've done a good bit of googling to try to understand how the "2000:30000000" and "1000:30000000" sections work. I realize that those are ARG2 and ARG3, and that ARG2 is the warning level and ARG3 is the critical level. What I don't understand is how to adjust those numbers to get the levels that I want to give warning and critical status on my particular switches. I've found info that states that two numbers followed by a colon are a range, and other info that says they are a less-than:higher-than definition for when to return the state defined by the command.
What I'd like is to have the following:
-Up to 60MB of free memory = OK
-Between 60MB and 40MB of free memory = Warning
-Less than 40MB of free memory = Critical
I will likely adjust those values once I get a better idea of memory usage under different loads.
I'd like to understand how to adjust the numbers in the service definition so that my service monitors will work as listed above. Can someone explain this, or point me to a resource that helps explain what the colon separated numbers mean on this particular command? I haven't had any luck in my searching, but I'm continuing to try to find as much information as I can to understand this.
Edit: I found the information I needed. the numbers are an inclusive range that "OK" should fall into. Anything outside that range will result in a status change. I looks like the example I copied was designed for a switch with 28MB of memory, so I adjusted the numbers accordingly and then tested with various values for warning and critical, and everything is working like I want it to.