LinuxQuestions.org - RHEL CLuster - Node 2

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - RHEL CLuster - Node 2 _ Auto Reboot (https://www.linuxquestions.org/questions/linux-server-73/rhel-cluster-node-2-_-auto-reboot-4175420645/)

RHEL CLuster - Node 2 _ Auto Reboot

Hi,

Please note that I am experiencing issue in which the node2 of 2 Node RHEL Cluster reboots by its own 5-7 times in last 3-4 days.

Please see the /var/log/messages during the same.

ug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] The token was lost in the OPERATIONAL state.
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] entering GATHER state from 2.
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] Storing new sequence id for ring 7cc
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] entering COMMIT state.
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] entering RECOVERY state.
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] position [0] member node1:
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] previous ring seq 1992 rep node1
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] aru f high delivered f received flag 1
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] position [1] member node2:
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] previous ring seq 1988 rep node1
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] aru 57 high delivered 57 received flag 1
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] Did not need to originate any messages in recovery.
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] CLM CONFIGURATION CHANGE
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] New Configuration:
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ]****** r(0) ip(node1)
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ]****** r(0) ip(node2)
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] Members Left:
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] Members Joined:
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] CLM CONFIGURATION CHANGE
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] New Configuration:
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ]****** r(0) ip(node1)
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ]****** r(0) ip(node2)
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] Members Left:
Aug* 4 20:20:27 node2-hostname openais[2707]: [CLM* ] Members Joined:
Aug* 4 20:20:27 node2-hostname openais[2707]: [SYNC ] This node is within the primary component and will provide service.
Aug* 4 20:20:27 node2-hostname openais[2707]: [TOTEM] entering OPERATIONAL state.
Aug* 4 20:20:27 node2-hostname xinetd[2982]: START: nrpe pid=1317 from=10.105.32.115
Aug* 4 20:20:28 node2-hostname openais[2707]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart
Aug* 4 20:20:28 node2-hostname openais[2707]: [CLM* ] got nodejoin message node1
Aug* 4 20:20:28 node2-hostname openais[2707]: [CLM* ] got nodejoin message node2
Aug* 4 20:20:28 node2-hostname openais[2707]: [CPG* ] got joinlist message from node 1
Aug* 4 20:20:28 node2-hostname openais[2707]: [CPG* ] got joinlist message from node 2
Aug* 4 20:20:28 node2-hostname dlm_controld[2733]: cluster is down, exiting
Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: groupd_dispatch error -1 errno 0
Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: groupd connection died
Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: cluster is down, exiting
Aug* 4 20:20:28 node2-hostname clurgmgrd[3630]: <warning> #67: Shutting down uncleanly
Aug* 4 20:20:29 node2-hostname fenced[2727]: cluster is down, exiting
Aug* 4 20:20:29 node2-hostname kernel: dlm: closing connection to node 2
Aug* 4 20:20:29 node2-hostname kernel: dlm: closing connection to node 1
Aug* 4 20:20:32 node2-hostname xinetd[2982]: EXIT: nrpe status=0 pid=1317 duration=5(sec)
Aug* 4 20:20:43 node2-hostname clurgmgrd[3630]: <notice> Disconnecting from CMAN
Aug* 4 20:20:43 node2-hostname clurgmgrd[3630]: <notice> Exiting
Aug* 4 20:20:57 node2-hostname ccsd[2699]: Unable to connect to cluster infrastructure after 30 seconds.
Aug* 4 20:21:27 node2-hostname ccsd[2699]: Unable to connect to cluster infrastructure after 60 seconds.
Aug* 4 20:21:57 node2-hostname ccsd[2699]: Unable to connect to cluster infrastructure after 90 seconds.
Aug* 4 20:22:27 node2-hostname ccsd[2699]: Unable to connect to cluster infrastructure after 120 seconds.

Please suggest

Code:

Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: groupd_dispatch error -1 errno 0

Aug* 4 20:20:28 node2-hostname gfs_controld[2739]: groupd connection died

Maybe check why this fails?
Is this a active/passive cluster or ...?

I have been doing some stuff with RHCluster however have never been happy with it.

Code:

Aug* 4 20:20:28 node2-hostname openais[2707]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart

The last 4 messages are probably because of cman not running.

Can you check logs on the other node as well?

Also, what does clustat show?

Quote:

Originally Posted by rajaniyer123 (Post 4747221)

Hi ranjaniyer i am also facing the same problem pls can u share the resolution