LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   Xen cluster (drbd, ocfs2, pacemaker) auto-recover problem (https://www.linuxquestions.org/questions/linux-server-73/xen-cluster-drbd-ocfs2-pacemaker-auto-recover-problem-929172/)

IMNOboist 02-13-2012 03:20 PM

Xen cluster (drbd, ocfs2, pacemaker) auto-recover problem
 
I have a Xen cluster running drbd, ocfs2 and Pacemaker on Ubuntu 11.10. I can live migrate a VM from node1 to node2 and vice-versa without any problems, but if I pull the power cord on the node running the VM, it fails starting up on the other node.

Here is what crm_mon shows after the failure:
Code:

============
Last updated: Mon Feb 13 13:06:20 2012
Stack: openais
Current DC: clutest2 - partition WITHOUT quorum
Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
5 Resources configured.
============

Online: [ clutest2 ]
OFFLINE: [ clutest1 ]

 Master/Slave Set: ms_drbd_master [p_drbd]
    Masters: [ clutest2 ]
    Stopped: [ p_drbd:0 ]
 Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]
    Started: [ clutest2 ]
    Stopped: [ g_ocfs2mgmt:0 ]
 Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]
    Started: [ clutest2 ]
    Stopped: [ p_fs_ocfs2:0 ]
p_xen-lolwut    (ocf::heartbeat:Xen):  Started clutest2 (unmanaged) FAILED

Failed actions:
    p_xen-lolwut_stop_0 (node=clutest2, call=-1, rc=1, status=Timed Out): unknown error
    p_xen-lolwut_start_0 (node=clutest2, call=-1, rc=1, status=Timed Out): unknown error

Here is the output from crm configure show:
Code:

node clutest1
node clutest2
primitive p_controld ocf:pacemaker:controld
primitive p_drbd ocf:linbit:drbd \
        params drbd_resource="r0" \
        operations $id="op_drbd" \
        op monitor interval="20" role="Master" timeout="20" \
        op monitor interval="30" role="Slave" timeout="20" \
        meta target-role="started"
primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
        params device="/dev/drbd/by-res/r0" directory="/domains" fstype="ocfs2" options="rw,noatime"
primitive p_o2cb ocf:pacemaker:o2cb
primitive p_xen-lolwut ocf:heartbeat:Xen \
        params xmfile="/domains/lolwut/lolwut.cfg" \
        op monitor interval="10s" \
        meta target-role="Started" allow-migrate="true"
primitive xen-domutest ocf:heartbeat:Xen \
        params xmfile="/domains/domutest/domutest.cfg" \
        op monitor interval="10s" \
        meta target-role="Stopped" allow-migrate="true"
group g_ocfs2mgmt p_controld p_o2cb
ms ms_drbd_master p_drbd \
        meta resource-stickiness="100" master-max="2" clone-max="2" notify="true" interleave="true"
clone cl_fs_ocfs2 p_fs_ocfs2
clone cl_ocfs2mgmt g_ocfs2mgmt \
        meta interleave="true"
colocation c_lolwut_fs inf: p_xen-lolwut cl_fs_ocfs2
colocation c_ocfs2 inf: cl_fs_ocfs2 cl_ocfs2mgmt ms_drbd_master:Master
order o_lolwut-after-fs inf: cl_fs_ocfs2:start p_xen-lolwut:start
order o_ocfs2 0: ms_drbd_master:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        default-resource-stickiness="1000"

Any help would be appreciated!

IMNOboist 02-13-2012 04:53 PM

It looks like it might have something to do with one of the many parts of having ocfs2. When I try to cd into the /domains directory (the mount-point of the ocfs2 filesystem running on drbd) the terminal hangs.

IMNOboist 02-14-2012 12:25 PM

When the node that is running the VM (in this case, clutest2) is shut down using halt -p, the VM tries to migrate to the other node but it fails and shows this:
Code:

============
Last updated: Tue Feb 14 10:20:39 2012
Stack: openais
Current DC: clutest2 - partition with quorum
Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
5 Resources configured.
============

Online: [ clutest1 clutest2 ]

 Master/Slave Set: ms_drbd_master [p_drbd]
    Masters: [ clutest1 clutest2 ]
 Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]
    Started: [ clutest1 clutest2 ]
 Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]
    p_fs_ocfs2:1      (ocf::heartbeat:Filesystem):    Started clutest2 (unmanaged) FAILED
    Started: [ clutest1 ]
p_xen-lolwut    (ocf::heartbeat:Xen):  Started clutest1

Failed actions:
    p_fs_ocfs2:1_stop_0 (node=clutest2, call=103, rc=-2, status=Timed Out): unknown exec error
    p_o2cb:1_stop_0 (node=clutest2, call=105, rc=1, status=complete): unknown error

Then the VM starts up on the other node (it didn't live migrate, it shut down then started up) but the node that got the halt command doesn't shut down. It just hangs.

However, if the server that isn't running the VM is shut down, the VM continues humming away on the remaining node without a problem.

Still not sure what's going on...


All times are GMT -5. The time now is 02:11 PM.