unpredictable fencing behaviour (cman & fenced)
Hi
We are using 2 servers with Fedora 14 as host for qemu-kvm (version 0.13) virtualisation environment and a MD3000i SAN. For the configuration of the 2 node-cluster see below.
When one host leaves the cluster (lets say vhost2) and vhost1 can't fence vhost2 the gfs2 file system hangs (VM's crash). gfs2_quotad seems to be the first victim (see the messages log below). Is there a way to prevent the other system from hanging?
Another strange behaviour is that even though the DRAC module is not connected (no UTP cable), the server still gets rebooted (by the other?). Can a reboot command also be given to the other host by use of the quorum disk?
I have also questions about power fencing (like Dell's DRAC module). Isn't it just to drastic to reboot a server that doenst respond (fast enough)? Is it more advisable to use a fencing method that blocks a port on the switch?
Thanks in advance!
=========================================
/etc/cluster/cluster.conf:
<?xml version="1.0"?>
<cluster config_version="2" name="spijkercluster">
<fence_daemon clean_start="1" post_fail_delay="45" post_join_delay="45"/>
<totem token="40000" />
<quorumd interval="1" tko="18" votes="1" label="rac_qdisk" >
</quorumd>
<clusternodes>
<clusternode name="vhost1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="vhost1-drac"/>
</method>
</fence>
</clusternode>
<clusternode name="vhost2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="vhost2-drac"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_drac5" cmd_prompt="/admin1->" ipaddr="10.90.80.101" login="root" name="vhost1-drac" passwd="Spijkerdrac"/>
<fencedevice agent="fence_drac5" cmd_prompt="/admin1->" ipaddr="10.90.80.102" login="root" name="vhost2-drac" passwd="Spijkerdrac"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
<dlm plock_ownership="1" plock_rate_limit="0"/>
<gfs_controld plock_rate_limit="0"/>
</cluster>
/var/log/messages
Feb 15 18:15:21 vhost1 corosync[3304]: [TOTEM ] A processor failed, forming new configuration.
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] CLM CONFIGURATION CHANGE
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] New Configuration:
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] #011r(0) ip(10.29.253.231)
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] Members Left:
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] #011r(0) ip(10.29.253.211)
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] Members Joined:
Feb 15 18:15:23 vhost1 corosync[3304]: [QUORUM] Members[1]: 1
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] CLM CONFIGURATION CHANGE
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] New Configuration:
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] #011r(0) ip(10.29.253.231)
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] Members Left:
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] Members Joined:
Feb 15 18:15:23 vhost1 corosync[3304]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 15 18:15:23 vhost1 corosync[3304]: [CPG ] chosen downlist from node r(0) ip(10.29.253.231)
Feb 15 18:15:23 vhost1 corosync[3304]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 15 18:15:23 vhost1 kernel: dlm: closing connection to node 2
Feb 15 18:15:23 vhost1 kernel: GFS2: fsid=spijkercluster:gfs_vd1.0: jid=1: Trying to acquire journal lock...
Feb 15 18:15:23 vhost1 kernel: GFS2: fsid=spijkercluster:gfs_vd2.0: jid=1: Trying to acquire journal lock...
Feb 15 18:16:08 vhost1 fenced[3534]: fencing node vhost2
Feb 15 18:16:14 vhost1 fenced[3534]: fence vhost2 dev 0.0 agent fence_drac5 result: error from agent
Feb 15 18:16:14 vhost1 fenced[3534]: fence vhost2 failed
Feb 15 18:16:17 vhost1 fenced[3534]: fencing node vhost2
Feb 15 18:16:22 vhost1 fenced[3534]: fence vhost2 dev 0.0 agent fence_drac5 result: error from agent
Feb 15 18:16:22 vhost1 fenced[3534]: fence vhost2 failed
Feb 15 18:16:25 vhost1 fenced[3534]: fencing node vhost2
Feb 15 18:16:30 vhost1 fenced[3534]: fence vhost2 dev 0.0 agent fence_drac5 result: error from agent
Feb 15 18:16:30 vhost1 fenced[3534]: fence vhost2 failed
Feb 15 18:19:19 vhost1 kernel: INFO: task kslowd001:3753 blocked for more than 120 seconds.
Feb 15 18:19:19 vhost1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 18:19:19 vhost1 kernel: kslowd001 D 0000000000000003 0 3753 2 0x00000080
Feb 15 18:19:19 vhost1 kernel: ffff88041d73f9c8 0000000000000046 0000000000000001 000000000000000c
Feb 15 18:19:19 vhost1 kernel: ffff88041d73ffd8 ffff88040f491770 00000000000153c0 ffff88041d73ffd8
Feb 15 18:19:19 vhost1 kernel: 00000000000153c0 00000000000153c0 00000000000153c0 00000000000153c0
Feb 15 18:19:19 vhost1 kernel: Call Trace:
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81037c57>] ? activate_task+0x2f/0x37
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144c77e>] rwsem_down_failed_common+0x91/0xc1
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144c7fe>] rwsem_down_read_failed+0x26/0x30
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81214af4>] call_rwsem_down_read_failed+0x14/0x30
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144beb4>] ? down_read+0x37/0x3b
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02c9af4>] dlm_lock+0x62/0x14d [dlm]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81213836>] ? vsnprintf+0x3ee/0x42a
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03048a9>] gdlm_lock+0xef/0x107 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa030498e>] ? gdlm_ast+0x0/0x116 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03048c1>] ? gdlm_bast+0x0/0x43 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ec6e5>] do_xmote+0xed/0x14f [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ec853>] run_queue+0x10c/0x14a [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ed782>] gfs2_glock_nq+0x282/0x2a6 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ed7f1>] gfs2_glock_nq_num+0x4b/0x73 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02fe5b9>] gfs2_recover_work+0x79/0x5e4 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8104814b>] ? try_to_wake_up+0x304/0x316
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810084fa>] ? __unlazy_fpu+0x78/0x85
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810085ee>] ? __switch_to+0xd7/0x227
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810c885c>] ? perf_event_task_sched_out+0x33/0x24a
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810460c3>] ? finish_task_switch+0x42/0xaf
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ed7e9>] ? gfs2_glock_nq_num+0x43/0x73 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810c236b>] slow_work_execute+0x195/0x2ce
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810c266d>] slow_work_thread+0x1c9/0x310
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81066133>] ? autoremove_wake_function+0x0/0x39
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810c24a4>] ? slow_work_thread+0x0/0x310
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81065cb9>] kthread+0x7f/0x87
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81065c3a>] ? kthread+0x0/0x87
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10
Feb 15 18:19:19 vhost1 kernel: INFO: task gfs2_quotad:3759 blocked for more than 120 seconds.
Feb 15 18:19:19 vhost1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 18:19:19 vhost1 kernel: gfs2_quotad D 000000000000000f 0 3759 2 0x00000080
Feb 15 18:19:19 vhost1 kernel: ffff88040d8c1ab8 0000000000000046 ffff88040f496198 0000000100000000
Feb 15 18:19:19 vhost1 kernel: ffff88040d8c1fd8 ffff88040f495dc0 00000000000153c0 ffff88040d8c1fd8
Feb 15 18:19:19 vhost1 kernel: 00000000000153c0 00000000000153c0 00000000000153c0 00000000000153c0
Feb 15 18:19:19 vhost1 kernel: Call Trace:
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144c77e>] rwsem_down_failed_common+0x91/0xc1
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8103dab8>] ? enqueue_entity+0x2d9/0x2e6
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144c7fe>] rwsem_down_read_failed+0x26/0x30
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81214af4>] call_rwsem_down_read_failed+0x14/0x30
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144beb4>] ? down_read+0x37/0x3b
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02c9af4>] dlm_lock+0x62/0x14d [dlm]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8104816f>] ? default_wake_function+0x12/0x14
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810385f2>] ? __wake_up_common+0x4e/0x84
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03048a9>] gdlm_lock+0xef/0x107 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa030498e>] ? gdlm_ast+0x0/0x116 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03048c1>] ? gdlm_bast+0x0/0x43 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ec6e5>] do_xmote+0xed/0x14f [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ec853>] run_queue+0x10c/0x14a [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ed782>] gfs2_glock_nq+0x282/0x2a6 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03016a2>] gfs2_glock_nq_init+0x1e/0x37 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa0301f4e>] gfs2_statfs_sync+0x44/0x13b [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa030169a>] ? gfs2_glock_nq_init+0x16/0x37 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81059821>] ? process_timeout+0x0/0x10
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02fbfd3>] quotad_check_timeo+0x2b/0x85 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02fc163>] gfs2_quotad+0x136/0x24d [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81066133>] ? autoremove_wake_function+0x0/0x39
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02fc02d>] ? gfs2_quotad+0x0/0x24d [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81065cb9>] kthread+0x7f/0x87
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81065c3a>] ? kthread+0x0/0x87
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10
Feb 15 18:19:19 vhost1 kernel: INFO: task gfs2_quotad:3769 blocked for more than 120 seconds.
Feb 15 18:19:19 vhost1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 18:19:19 vhost1 kernel: gfs2_quotad D 000000000000000b 0 3769 2 0x00000080
Last edited by boeboe2005; 03-14-2011 at 09:22 AM.
Reason: forgot
|