Cluster failure with error (unmanaged) FAILED

kumarlimbu · 08-19-2010, 08:40 PM

Hi,

We are using Linux HA to manage our cluster of 2 web servers.

Both web server are using idential software (OS, http, tomcat servers etc are all same) but different hardware. Both servers have 64-bit processor.

Following are the software being used:

1. CentOS 5.4

2. Pacemaker 1.0.5

3. OpenAIS 0.80

4. Cluster-glue 1.0-12

5. resource agents:- ocf, heartbeat

Under normal circumstances both the IPs are accessible and everything seems to be working well.

============

Last updated: Thu Aug 19 17:24:34 2010

Stack: openais

Current DC: server1 - partition with quorum

Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7

2 Nodes configured, 2 expected votes

2 Resources configured.

============

Online: [ server1 server2 ]

ClusterIP1 (ocf::heartbeat:IPaddr2): Started server1

ClusterIP2 (ocf::heartbeat:IPaddr2): Started server2

We need to copy new/updated files to our servers periodically and during this operation the server becomes slow. So when the file is being copied on server1, we change it to standby mode by issuing the command

- crm node standby

Output of crm_mon command during this time:

============

Last updated: Thu Aug 19 17:43:14 2010

Stack: openais

Current DC: server1 - partition with quorum

Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7

2 Nodes configured, 2 expected votes

2 Resources configured.

============

Node server1: standby

Online: [ server2 ]

ClusterIP1 (ocf::heartbeat:IPaddr2): Started server2

ClusterIP2 (ocf::heartbeat:IPaddr2): Started server2

So during this time every request is being handled by server2. After the file is copied, we take it online using

- crm node online

This setting has been working well for us and the servers go to standby mode and comes back online without much issue. Recently we are seeing that one of the IPs becomes in accessible and it is always ClusterIP2. It won't return to normal until that server is restarted.

Output of the crm_mon command when the ClusterIP2 is inaccessible:

============

Last updated: Thu Aug 19 09:12:52 2010

Stack: openais

Current DC: server1 - partition with quorum

Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7

2 Nodes configured, 2 expected votes

2 Resources configured.

============

Online: [ server1 server2 ]

ClusterIP1 (ocf::heartbeat:IPaddr2): Started server1

ClusterIP2 (ocf::heartbeat:IPaddr2): Started server1 (unmanaged) FAILED

Failed actions:

ClusterIP2_stop_0 (node=server1, call=5400, rc=1, status=complete): unknown error

We are baffled because this problem is occuring with more regularity and we haven't modified any of the cluster settings.

- What usually causes one of the IP address to become inaccessible?

- Are settings could we change to avoid this situation in the future?

If more information regarding our configuration or logs are required, please let me know.