[SOLVED] Pacemaker is continuously bouncing my lsb-resource

Immunitet · 06-15-2017, 08:14 AM

Hi all!
There is RHEL 7.3 2-node failover cluster.
I created by own lsb-scrip with start/stop/monitor parameters.
I checked the script manually - it works fine, according to lsb specs

Then I added my scrip to a cluster resources:
pcs resource create My_application lsb:my_script.sh op monitor interval=40s start timeout=20s stop timeout=20s

and now the issue is: pacemaker starts and stops my resource continuously.
How can I troubleshoot this? Are there any special requirement for lsb-script in cluster?

Demosa · 06-15-2017, 01:38 PM

I could be off (been a while since I've dealt with Pacemaker), but your monitor interval is longer than your timeout intervals, so it sounds like you're running into Fencing.

Try changing the start and stop timeouts to something longer, and the monitor interval shorter, see if that fixes it. (30s monitor, 2m start/stop)

Immunitet · 06-19-2017, 06:38 AM

Quote:

Originally Posted by Demosa

your monitor interval is longer than your timeout intervals, so it sounds like you're running into Fencing.

Try changing the start and stop timeouts to something longer, and the monitor interval shorter, see if that fixes it. (30s monitor, 2m start/stop)

It works in different way:

monitor interval - is the interval for the health check command.
start/stop timeout - is the time period while cluster will wait for the response from the script. And if script does not return any result within this period cluster will consider it like a failure.
So, those parameters do not affect each other.

With regard to my problem - I've found the reason.
I used "monitor" command for health check while the LSB-standard assumes "status" instead. I did this looking into corosync.log:

Quote:

Initiating monitor operation My_application_monitor_0 locally on Node2
Result of monitor operation for My_application on Node2: 7 (not running)
Initiating start operation My_application_start_0 locally on Node2
Result of start operation for My_application on Node2: 0 (ok)
Initiating monitor operation My_application_monitor_30000 locally on Node2
Result of monitor operation for My_application on Node2: 7 (not running)
Initiating stop operation My_application_stop_0 locally on Node2 | action 2
Result of stop operation for My_application on Node2: 0 (ok)

Pacemaker mainly works with OCF-standard where "monitor" command is used, but for LSB it sends "status" and writes "monitor" to the log file.

I just changed the "monitor()" function name to "status()" and it become stable. Actually, it was my fault since LSB-standard does not have "monitor" command in the specification.