Quote:
Originally Posted by brianmcgee
The cluster communication uses a heartbeat to detect if all nodes are alive.
If you disconnect a node with force (e.g. removing the network link) the cluster notices this and wants to make sure that the failed node is dead for sure. Thus it tries to fence the node to get back to a known state.
That behaviour is not only useful if you configure GFS but also an important necessity for a working application failover when using rgmanager.
In case that a application is hung the fencing of the cluster node makes sure that there are no complications with the application switchover to another node.
In your case maybe keepalived [1] is a better choice. There the failing node gets removed from the loadbalancing.
[1] http://www.keepalived.org/
|
Thank you for your quick outline.
Regarding my scenario, do you know what node B is actually trying to do when it tries to fence node A? I know that I can define different fencing methods, like reboot and so on, but as I've not defined any fencing methods I'm not sure what node B is trying to do.
I did a failover test in which I on node A did a "ifdown eth0". As this was the only link from node A to the rest of the world (including node B) node B wasn't able to really fence node A. But node A restarted all by itself. Is this the default behavior - nodes that loose contact with the cluster reboots themselves?
And one last thing: I did a "shutdown -h now" on node A, and it started the shutdown process. But when it came to bringing down the cluster software it hung on "Stopping fencing...". To kill the machine I had to push the power off button. Why did the node hang on this?
Phew, this was a lot of info and questions, but I would very much appreciate some input on this. I need to get a better understanding of how fencing actually works, and I haven't found any good resources on this particular subject.
Regards,
kenneho