RHEL CLuster - Node 2 _ Auto Reboot
Hi,
Please note that I am experiencing issue in which the node2 of 2 Node RHEL Cluster reboots by its own 5-7 times in last 3-4 days. Please see the /var/log/messages during the same. ug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] The token was lost in the OPERATIONAL state. Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] entering GATHER state from 2. Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] Storing new sequence id for ring 7cc Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] entering COMMIT state. Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] entering RECOVERY state. Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] position [0] member node1: Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] previous ring seq 1992 rep node1 Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] aru f high delivered f received flag 1 Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] position [1] member node2: Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] previous ring seq 1988 rep node1 Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] aru 57 high delivered 57 received flag 1 Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] Did not need to originate any messages in recovery. Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] CLM CONFIGURATION CHANGE Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] New Configuration: Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ]****** r(0) ip(node1) Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ]****** r(0) ip(node2) Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] Members Left: Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] Members Joined: Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] CLM CONFIGURATION CHANGE Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] New Configuration: Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ]****** r(0) ip(node1) Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ]****** r(0) ip(node2) Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] Members Left: Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] Members Joined: Aug* 4 20:20:27 node2-hostname openais[2707]: [SYNC ] This node is within the primary component and will provide service. Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] entering OPERATIONAL state. Aug* 4 20:20:27 node2-hostname xinetd[2982]: START: nrpe pid=1317 from=10.105.32.115 Aug* 4 20:20:28 node2-hostname openais[2707]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart Aug* 4 20:20:28 node2-hostname openais[2707]: [CLM* ] got nodejoin message node1 Aug* 4 20:20:28 node2-hostname openais[2707]: [CLM* ] got nodejoin message node2 Aug* 4 20:20:28 node2-hostname openais[2707]: [CPG* ] got joinlist message from node 1 Aug* 4 20:20:28 node2-hostname openais[2707]: [CPG* ] got joinlist message from node 2 Aug* 4 20:20:28 node2-hostname dlm_controld[2733]: cluster is down, exiting Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: groupd_dispatch error -1 errno 0 Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: groupd connection died Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: cluster is down, exiting Aug* 4 20:20:28 node2-hostname clurgmgrd[3630]: <warning> #67: Shutting down uncleanly Aug* 4 20:20:29 node2-hostname fenced[2727]: cluster is down, exiting Aug* 4 20:20:29 node2-hostname kernel: dlm: closing connection to node 2 Aug* 4 20:20:29 node2-hostname kernel: dlm: closing connection to node 1 Aug* 4 20:20:32 node2-hostname xinetd[2982]: EXIT: nrpe status=0 pid=1317 duration=5(sec) Aug* 4 20:20:43 node2-hostname clurgmgrd[3630]: <notice> Disconnecting from CMAN Aug* 4 20:20:43 node2-hostname clurgmgrd[3630]: <notice> Exiting Aug* 4 20:20:57 node2-hostname ccsd[2699]: Unable to connect to cluster infrastructure after 30 seconds. Aug* 4 20:21:27 node2-hostname ccsd[2699]: Unable to connect to cluster infrastructure after 60 seconds. Aug* 4 20:21:57 node2-hostname ccsd[2699]: Unable to connect to cluster infrastructure after 90 seconds. Aug* 4 20:22:27 node2-hostname ccsd[2699]: Unable to connect to cluster infrastructure after 120 seconds. Please suggest |
Code:
Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: groupd_dispatch error -1 errno 0 Is this a active/passive cluster or ...? I have been doing some stuff with RHCluster however have never been happy with it. Code:
Aug* 4 20:20:28 node2-hostname openais[2707]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart Can you check logs on the other node as well? Also, what does clustat show? |
Quote:
|
All times are GMT -5. The time now is 06:01 AM. |