Pacemaker/Corosync Cluster Monitoring Action keeps failling (Wildfly)

topiyobol · 06-12-2019, 06:35 AM

Hi all,

I have a 2-node cluster without STONITH based on the following software:
Ubuntu 18.04.1 LTS
Pacemaker 1.1.18
Corosync Cluster Engine, version '2.4.3'

It is not the first cluster I build but the first one based on based on Ubuntu 18.04 (so far I've been working with 16.04).

The following resources are configured: DRBD storage, virtual IP, database (postgres), Apache and wildfly server

Everything works as expected except that the Wildfly service restarts quite frequently (~once every 1~5 days) and I see this error message in the crm_mon:
Migration Summary:
* Node test-node2:
res_wildfly: migration-threshold=1000000 fail-count=16 last-failure='Sat Jun 8 06:55:20 2019'
* Node test-node1:

Failed Actions:
* res_wildfly_monitor_30000 on test-node2 'unknown error' (1): call=306, status=complete, exitreason='',
last-rc-change='Sat Jun 8 06:55:20 2019', queued=0ms, exec=0ms

The corosync log doesn't reveal much more:
Jun 08 06:55:20 [882] test-node2 crmd: info: process_lrm_event: Result of monitor operation for res_wildfly on test-node2: 1 (unknown error) | call=306 key=res_wildfly_monitor_30000 confirmed=false cib-update=2815

---------
The resource is configured with the class systemd.

Has anyone else experienced this problem or any idea what it could be? Wildfly would be running stable but gets the signal to restart from the operating system (from the cluster manager) due to this situation of some apparently failing monitoring. Disabling the monitoring is not an option, because then we would not notice if the Wildfly service is no longer available.

Let me know if I can help to understand my situation better with any additional log or configuration information.

Thank you in advance.