Redhat cluster malfunctioning
Hi experts,
I am new to this forum, the reason why i am here now is because of a production server cluster related issue that makes me commpletly disturbed.I am new to Redhat cluster as well. This is a 2 node cluster,Operating system installed on these node is RHEL 5.3. The system was running fine until last week, well things changed all of a sudden by making one of the node(node2) in 2 node cluster offline. All the cluster related services were hung and the server was in a state not to reboot.I had to kill rgmmanager service to reboot the server, however the system rebooted and came up in cluster mode which made the other node (node1) off-line. All that i understood from this was the cluster was unable to keep both the nodes on-line simultaneously.The same happened when i rebooted the node1,which killed the node2 upon its reboot. I have now kept the node2 down in order to run the production application installed in this server. Looking forward to your valuable reply as this is a really concerned issue for me which is in production environment. Logs from node1 when the node2 was booted into cluster is pasted here for your ready reference. MESSAGE FILE OUTPUT --------------------- Feb 2 15:06:39 htbapp1 openais[3840]: [SYNC ] This node is within the primary component and will provide service. Feb 2 15:06:39 htbapp1 kernel: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping 05 Feb 2 15:06:39 htbapp1 openais[3840]: [TOTEM] entering OPERATIONAL state. Feb 2 15:06:39 htbapp1 kernel: Brought up 8 CPUs Feb 2 15:06:39 htbapp1 openais[3840]: [MAIN ] Killing node htbapp2.ksebnet.com because it has rejoined the cluster with existing state Feb 2 15:06:39 htbapp1 kernel: testing NMI watchdog ... OK. Feb 2 15:06:40 htbapp1 kernel: time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer. Feb 2 15:06:40 htbapp1 kernel: time.c: Detected 2266.835 MHz processor. |
If this is a production issue then why haven't you raised a ticket with Red Hat support? .. they are the experts on this technology. Personally I wouldn't be running production without support unless I really knew what I was doing.
|
Hi Guys,
This issue has been resolved !!! The culprit was "acpid" (power management)daemon that is not supposed to be running in cluster which caused the cluster nodes to mal-function. cluster started working perfect after the acpid daemon stopped in the startup. Many thanks for your great tries and helps. Rgrds, Sree |
All times are GMT -5. The time now is 04:06 AM. |