Cluster failure with error (unmanaged) FAILED
Hi,
We are using Linux HA to manage our cluster of 2 web servers.
Both web server are using idential software (OS, http, tomcat servers etc are all same) but different hardware. Both servers have 64-bit processor.
Following are the software being used:
1. CentOS 5.4
2. Pacemaker 1.0.5
3. OpenAIS 0.80
4. Cluster-glue 1.0-12
5. resource agents:- ocf, heartbeat
Under normal circumstances both the IPs are accessible and everything seems to be working well.
============
Last updated: Thu Aug 19 17:24:34 2010
Stack: openais
Current DC: server1 - partition with quorum
Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Online: [ server1 server2 ]
ClusterIP1 (ocf::heartbeat:IPaddr2): Started server1
ClusterIP2 (ocf::heartbeat:IPaddr2): Started server2
We need to copy new/updated files to our servers periodically and during this operation the server becomes slow. So when the file is being copied on server1, we change it to standby mode by issuing the command
- crm node standby
Output of crm_mon command during this time:
============
Last updated: Thu Aug 19 17:43:14 2010
Stack: openais
Current DC: server1 - partition with quorum
Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Node server1: standby
Online: [ server2 ]
ClusterIP1 (ocf::heartbeat:IPaddr2): Started server2
ClusterIP2 (ocf::heartbeat:IPaddr2): Started server2
So during this time every request is being handled by server2. After the file is copied, we take it online using
- crm node online
This setting has been working well for us and the servers go to standby mode and comes back online without much issue. Recently we are seeing that one of the IPs becomes in accessible and it is always ClusterIP2. It won't return to normal until that server is restarted.
Output of the crm_mon command when the ClusterIP2 is inaccessible:
============
Last updated: Thu Aug 19 09:12:52 2010
Stack: openais
Current DC: server1 - partition with quorum
Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Online: [ server1 server2 ]
ClusterIP1 (ocf::heartbeat:IPaddr2): Started server1
ClusterIP2 (ocf::heartbeat:IPaddr2): Started server1 (unmanaged) FAILED
Failed actions:
ClusterIP2_stop_0 (node=server1, call=5400, rc=1, status=complete): unknown error
We are baffled because this problem is occuring with more regularity and we haven't modified any of the cluster settings.
- What usually causes one of the IP address to become inaccessible?
- Are settings could we change to avoid this situation in the future?
If more information regarding our configuration or logs are required, please let me know.
|