LinuxQuestions.org - Xen cluster (drbd, ocfs2, pacemaker) auto-recover problem

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - Xen cluster (drbd, ocfs2, pacemaker) auto-recover problem (https://www.linuxquestions.org/questions/linux-server-73/xen-cluster-drbd-ocfs2-pacemaker-auto-recover-problem-929172/)

Xen cluster (drbd, ocfs2, pacemaker) auto-recover problem

I have a Xen cluster running drbd, ocfs2 and Pacemaker on Ubuntu 11.10. I can live migrate a VM from node1 to node2 and vice-versa without any problems, but if I pull the power cord on the node running the VM, it fails starting up on the other node.

Here is what crm_mon shows after the failure:

Code:

============

Last updated: Mon Feb 13 13:06:20 2012

Stack: openais

Current DC: clutest2 - partition WITHOUT quorum

Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f

2 Nodes configured, 2 expected votes

5 Resources configured.

============



Online: [ clutest2 ]

OFFLINE: [ clutest1 ]



 Master/Slave Set: ms_drbd_master [p_drbd]

    Masters: [ clutest2 ]

    Stopped: [ p_drbd:0 ]

 Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]

    Started: [ clutest2 ]

    Stopped: [ g_ocfs2mgmt:0 ]

 Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]

    Started: [ clutest2 ]

    Stopped: [ p_fs_ocfs2:0 ]

p_xen-lolwut    (ocf::heartbeat:Xen):  Started clutest2 (unmanaged) FAILED



Failed actions:

    p_xen-lolwut_stop_0 (node=clutest2, call=-1, rc=1, status=Timed Out): unknown error

    p_xen-lolwut_start_0 (node=clutest2, call=-1, rc=1, status=Timed Out): unknown error

Here is the output from crm configure show:

Code:

node clutest1

node clutest2

primitive p_controld ocf:pacemaker:controld

primitive p_drbd ocf:linbit:drbd \

        params drbd_resource="r0" \

        operations $id="op_drbd" \

        op monitor interval="20" role="Master" timeout="20" \

        op monitor interval="30" role="Slave" timeout="20" \

        meta target-role="started"

primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \

        params device="/dev/drbd/by-res/r0" directory="/domains" fstype="ocfs2" options="rw,noatime"

primitive p_o2cb ocf:pacemaker:o2cb

primitive p_xen-lolwut ocf:heartbeat:Xen \

        params xmfile="/domains/lolwut/lolwut.cfg" \

        op monitor interval="10s" \

        meta target-role="Started" allow-migrate="true"

primitive xen-domutest ocf:heartbeat:Xen \

        params xmfile="/domains/domutest/domutest.cfg" \

        op monitor interval="10s" \

        meta target-role="Stopped" allow-migrate="true"

group g_ocfs2mgmt p_controld p_o2cb

ms ms_drbd_master p_drbd \

        meta resource-stickiness="100" master-max="2" clone-max="2" notify="true" interleave="true"

clone cl_fs_ocfs2 p_fs_ocfs2

clone cl_ocfs2mgmt g_ocfs2mgmt \

        meta interleave="true"

colocation c_lolwut_fs inf: p_xen-lolwut cl_fs_ocfs2

colocation c_ocfs2 inf: cl_fs_ocfs2 cl_ocfs2mgmt ms_drbd_master:Master

order o_lolwut-after-fs inf: cl_fs_ocfs2:start p_xen-lolwut:start

order o_ocfs2 0: ms_drbd_master:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start

property $id="cib-bootstrap-options" \

        dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \

        cluster-infrastructure="openais" \

        expected-quorum-votes="2" \

        no-quorum-policy="ignore" \

        stonith-enabled="false" \

        default-resource-stickiness="1000"

Any help would be appreciated!

It looks like it might have something to do with one of the many parts of having ocfs2. When I try to cd into the /domains directory (the mount-point of the ocfs2 filesystem running on drbd) the terminal hangs.

When the node that is running the VM (in this case, clutest2) is shut down using halt -p, the VM tries to migrate to the other node but it fails and shows this:

Code:

============

Last updated: Tue Feb 14 10:20:39 2012

Stack: openais

Current DC: clutest2 - partition with quorum

Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f

2 Nodes configured, 2 expected votes

5 Resources configured.

============



Online: [ clutest1 clutest2 ]



 Master/Slave Set: ms_drbd_master [p_drbd]

    Masters: [ clutest1 clutest2 ]

 Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]

    Started: [ clutest1 clutest2 ]

 Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]

    p_fs_ocfs2:1      (ocf::heartbeat:Filesystem):    Started clutest2 (unmanaged) FAILED

    Started: [ clutest1 ]

p_xen-lolwut    (ocf::heartbeat:Xen):  Started clutest1



Failed actions:

    p_fs_ocfs2:1_stop_0 (node=clutest2, call=103, rc=-2, status=Timed Out): unknown exec error

    p_o2cb:1_stop_0 (node=clutest2, call=105, rc=1, status=complete): unknown error

Then the VM starts up on the other node (it didn't live migrate, it shut down then started up) but the node that got the halt command doesn't shut down. It just hangs.

However, if the server that isn't running the VM is shut down, the VM continues humming away on the remaining node without a problem.

Still not sure what's going on...