Redhat cluster malfuctioning

sree.m · 02-15-2012, 10:34 AM

Hi experts,

I am new to this forum, the reason why i am in this forum now is because of a production server cluster related issue that makes me commpletly disturbed.I am new to Redhat cluster as well.

This is a 2 node cluster,Operating system installed on these node is RHEL 5.3.

The system was running fine until last week, well things changed all of a sudden by making one of the node(node2) in 2 node cluster offline.

All the cluster related services were hung and the server was in a state not to reboot.I had to kill rgmmanager service to reboot the server, however the system rebooted and came up in cluster mode which made the other node (node1) off-line.

All that i understood from this was the cluster was unable to keep both the nodes on-line simultaneously.The same happened when i rebooted the node1,which killed the node2 upon its reboot.

I have now kept the node2 down in order to run the production application installed in this server.

Looking forward to your valuable reply as this is a really concerned issue for me which is in production environment.

Logs from node1 when the node2 was booted into cluster is pasted here for your ready reference.
MESSAGE FILE OUTPUT
---------------------

Feb 2 15:06:39 htbapp1 openais[3840]: [SYNC ] This node is within the primary component and will provide service.
Feb 2 15:06:39 htbapp1 kernel: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping 05
Feb 2 15:06:39 htbapp1 openais[3840]: [TOTEM] entering OPERATIONAL state.
Feb 2 15:06:39 htbapp1 kernel: Brought up 8 CPUs
Feb 2 15:06:39 htbapp1 openais[3840]: [MAIN ] Killing node htbapp2.ksebnet.com because it has rejoined the cluster with existing state
Feb 2 15:06:39 htbapp1 kernel: testing NMI watchdog ... OK.
Feb 2 15:06:40 htbapp1 kernel: time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer.
Feb 2 15:06:40 htbapp1 kernel: time.c: Detected 2266.835 MHz processor.

Thanks in advance
Sree

TenTenths · 02-15-2012, 11:41 AM

Contact RedHat, that's what you're paying the support for.

sree.m · 02-16-2012, 01:08 AM

My contract has been expired on last month.

John VV · 02-16-2012, 04:03 AM

see post 9 on your other thread
https://www.linuxquestions.org/quest...7/#post4603818

sree.m · 02-20-2012, 01:19 AM

Has anyone got any clue about this issue ??

Sree

rhbegin · 02-20-2012, 10:02 AM

Is this an httpd cluster?

sree.m · 02-20-2012, 11:13 PM

Quote:

Originally Posted by rhbegin

Is this an httpd cluster?

Nopes.The application running in this server is jboss, is a production system.

Never matter what application is running in the cluster, issue which makes me paranoid is with cluster processes that tends not to work simultaneously on both the nodes.

sree

rhbegin · 02-21-2012, 01:34 PM

I have setup a jboss server in RHEL5 x86_64 but it has been a couple of years, if I remember correctly it was challenging as the setup was pretty complex.

Do you have support with Red Hat they have jboss support, when I first started down this path I had to use Red Hat support since it was new (to me) and the company.

sree.m · 02-22-2012, 05:51 AM

Quote:

Originally Posted by rhbegin

I have setup a jboss server in RHEL5 x86_64 but it has been a couple of years, if I remember correctly it was challenging as the setup was pretty complex.

Do you have support with Red Hat they have jboss support, when I first started down this path I had to use Red Hat support since it was new (to me) and the company.

The reason why i posted this thread here is coz the support with RHEL has been expired on last Nov and this problem was happened on last month. So obviously i had to seek help from linux experts who is playing right here. This seems to be a cluster BUG and i have no idea how to get rid of this.

sree

rhbegin · 02-22-2012, 10:05 AM

If it is a bug, could you migrate over to CentOS with your existing config's where it is possible to download updates.

This way you could work towards a problem resolution if you cannot download updates, just something to throw out there.

As with any software clustering suites, they can be very complex and you may have to break down and purchase support if it is a production system. You have to weigh the cost of being down vs. paying for 1 year to get the help on it.

sree.m · 02-23-2012, 04:28 AM

Quote:

Originally Posted by rhbegin

If it is a bug, could you migrate over to CentOS with your existing config's where it is possible to download updates.

This way you could work towards a problem resolution if you cannot download updates, just something to throw out there.

As with any software clustering suites, they can be very complex and you may have to break down and purchase support if it is a production system. You have to weigh the cost of being down vs. paying for 1 year to get the help on it.

since this is a production system, I cannot go for os switch. I would possibly convince my manager to go for support renewal. But i wonder if I could get the right resolution method from here.

sree.m · 04-18-2012, 01:03 AM

Hi Guys,

This issue has been resolved !!! The culprit was "acpid" (power management)daemon that is not supposed to be running in cluster which caused the cluster nodes to mal-function. cluster started working perfect after the acpid daemon stopped in the startup.

Many thanks for your great tries and helps.

Rgrds,
Sree