LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 03-14-2011, 09:18 AM   #1
boeboe2005
LQ Newbie
 
Registered: Nov 2005
Posts: 13

Rep: Reputation: 0
unpredictable fencing behaviour (cman & fenced)


Hi

We are using 2 servers with Fedora 14 as host for qemu-kvm (version 0.13) virtualisation environment and a MD3000i SAN. For the configuration of the 2 node-cluster see below.
When one host leaves the cluster (lets say vhost2) and vhost1 can't fence vhost2 the gfs2 file system hangs (VM's crash). gfs2_quotad seems to be the first victim (see the messages log below). Is there a way to prevent the other system from hanging?

Another strange behaviour is that even though the DRAC module is not connected (no UTP cable), the server still gets rebooted (by the other?). Can a reboot command also be given to the other host by use of the quorum disk?

I have also questions about power fencing (like Dell's DRAC module). Isn't it just to drastic to reboot a server that doenst respond (fast enough)? Is it more advisable to use a fencing method that blocks a port on the switch?

Thanks in advance!

=========================================

/etc/cluster/cluster.conf:
<?xml version="1.0"?>
<cluster config_version="2" name="spijkercluster">
<fence_daemon clean_start="1" post_fail_delay="45" post_join_delay="45"/>
<totem token="40000" />
<quorumd interval="1" tko="18" votes="1" label="rac_qdisk" >
</quorumd>
<clusternodes>
<clusternode name="vhost1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="vhost1-drac"/>
</method>
</fence>
</clusternode>
<clusternode name="vhost2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="vhost2-drac"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_drac5" cmd_prompt="/admin1->" ipaddr="10.90.80.101" login="root" name="vhost1-drac" passwd="Spijkerdrac"/>
<fencedevice agent="fence_drac5" cmd_prompt="/admin1->" ipaddr="10.90.80.102" login="root" name="vhost2-drac" passwd="Spijkerdrac"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
<dlm plock_ownership="1" plock_rate_limit="0"/>
<gfs_controld plock_rate_limit="0"/>
</cluster>

/var/log/messages

Feb 15 18:15:21 vhost1 corosync[3304]: [TOTEM ] A processor failed, forming new configuration.
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] CLM CONFIGURATION CHANGE
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] New Configuration:
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] #011r(0) ip(10.29.253.231)
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] Members Left:
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] #011r(0) ip(10.29.253.211)
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] Members Joined:
Feb 15 18:15:23 vhost1 corosync[3304]: [QUORUM] Members[1]: 1
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] CLM CONFIGURATION CHANGE
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] New Configuration:
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] #011r(0) ip(10.29.253.231)
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] Members Left:
Feb 15 18:15:23 vhost1 corosync[3304]: [CLM ] Members Joined:
Feb 15 18:15:23 vhost1 corosync[3304]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Feb 15 18:15:23 vhost1 corosync[3304]: [CPG ] chosen downlist from node r(0) ip(10.29.253.231)
Feb 15 18:15:23 vhost1 corosync[3304]: [MAIN ] Completed service synchronization, ready to provide service.
Feb 15 18:15:23 vhost1 kernel: dlm: closing connection to node 2
Feb 15 18:15:23 vhost1 kernel: GFS2: fsid=spijkercluster:gfs_vd1.0: jid=1: Trying to acquire journal lock...
Feb 15 18:15:23 vhost1 kernel: GFS2: fsid=spijkercluster:gfs_vd2.0: jid=1: Trying to acquire journal lock...
Feb 15 18:16:08 vhost1 fenced[3534]: fencing node vhost2
Feb 15 18:16:14 vhost1 fenced[3534]: fence vhost2 dev 0.0 agent fence_drac5 result: error from agent
Feb 15 18:16:14 vhost1 fenced[3534]: fence vhost2 failed
Feb 15 18:16:17 vhost1 fenced[3534]: fencing node vhost2
Feb 15 18:16:22 vhost1 fenced[3534]: fence vhost2 dev 0.0 agent fence_drac5 result: error from agent
Feb 15 18:16:22 vhost1 fenced[3534]: fence vhost2 failed
Feb 15 18:16:25 vhost1 fenced[3534]: fencing node vhost2
Feb 15 18:16:30 vhost1 fenced[3534]: fence vhost2 dev 0.0 agent fence_drac5 result: error from agent
Feb 15 18:16:30 vhost1 fenced[3534]: fence vhost2 failed
Feb 15 18:19:19 vhost1 kernel: INFO: task kslowd001:3753 blocked for more than 120 seconds.
Feb 15 18:19:19 vhost1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 18:19:19 vhost1 kernel: kslowd001 D 0000000000000003 0 3753 2 0x00000080
Feb 15 18:19:19 vhost1 kernel: ffff88041d73f9c8 0000000000000046 0000000000000001 000000000000000c
Feb 15 18:19:19 vhost1 kernel: ffff88041d73ffd8 ffff88040f491770 00000000000153c0 ffff88041d73ffd8
Feb 15 18:19:19 vhost1 kernel: 00000000000153c0 00000000000153c0 00000000000153c0 00000000000153c0
Feb 15 18:19:19 vhost1 kernel: Call Trace:
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81037c57>] ? activate_task+0x2f/0x37
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144c77e>] rwsem_down_failed_common+0x91/0xc1
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144c7fe>] rwsem_down_read_failed+0x26/0x30
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81214af4>] call_rwsem_down_read_failed+0x14/0x30
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144beb4>] ? down_read+0x37/0x3b
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02c9af4>] dlm_lock+0x62/0x14d [dlm]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81213836>] ? vsnprintf+0x3ee/0x42a
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03048a9>] gdlm_lock+0xef/0x107 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa030498e>] ? gdlm_ast+0x0/0x116 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03048c1>] ? gdlm_bast+0x0/0x43 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ec6e5>] do_xmote+0xed/0x14f [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ec853>] run_queue+0x10c/0x14a [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ed782>] gfs2_glock_nq+0x282/0x2a6 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ed7f1>] gfs2_glock_nq_num+0x4b/0x73 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02fe5b9>] gfs2_recover_work+0x79/0x5e4 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8104814b>] ? try_to_wake_up+0x304/0x316
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810084fa>] ? __unlazy_fpu+0x78/0x85
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810085ee>] ? __switch_to+0xd7/0x227
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810c885c>] ? perf_event_task_sched_out+0x33/0x24a
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810460c3>] ? finish_task_switch+0x42/0xaf
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ed7e9>] ? gfs2_glock_nq_num+0x43/0x73 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810c236b>] slow_work_execute+0x195/0x2ce
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810c266d>] slow_work_thread+0x1c9/0x310
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81066133>] ? autoremove_wake_function+0x0/0x39
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810c24a4>] ? slow_work_thread+0x0/0x310
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81065cb9>] kthread+0x7f/0x87
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81065c3a>] ? kthread+0x0/0x87
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10
Feb 15 18:19:19 vhost1 kernel: INFO: task gfs2_quotad:3759 blocked for more than 120 seconds.
Feb 15 18:19:19 vhost1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 18:19:19 vhost1 kernel: gfs2_quotad D 000000000000000f 0 3759 2 0x00000080
Feb 15 18:19:19 vhost1 kernel: ffff88040d8c1ab8 0000000000000046 ffff88040f496198 0000000100000000
Feb 15 18:19:19 vhost1 kernel: ffff88040d8c1fd8 ffff88040f495dc0 00000000000153c0 ffff88040d8c1fd8
Feb 15 18:19:19 vhost1 kernel: 00000000000153c0 00000000000153c0 00000000000153c0 00000000000153c0
Feb 15 18:19:19 vhost1 kernel: Call Trace:
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144c77e>] rwsem_down_failed_common+0x91/0xc1
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8103dab8>] ? enqueue_entity+0x2d9/0x2e6
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144c7fe>] rwsem_down_read_failed+0x26/0x30
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81214af4>] call_rwsem_down_read_failed+0x14/0x30
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8144beb4>] ? down_read+0x37/0x3b
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02c9af4>] dlm_lock+0x62/0x14d [dlm]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8104816f>] ? default_wake_function+0x12/0x14
Feb 15 18:19:19 vhost1 kernel: [<ffffffff810385f2>] ? __wake_up_common+0x4e/0x84
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03048a9>] gdlm_lock+0xef/0x107 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa030498e>] ? gdlm_ast+0x0/0x116 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03048c1>] ? gdlm_bast+0x0/0x43 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ec6e5>] do_xmote+0xed/0x14f [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ec853>] run_queue+0x10c/0x14a [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02ed782>] gfs2_glock_nq+0x282/0x2a6 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa03016a2>] gfs2_glock_nq_init+0x1e/0x37 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa0301f4e>] gfs2_statfs_sync+0x44/0x13b [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa030169a>] ? gfs2_glock_nq_init+0x16/0x37 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81059821>] ? process_timeout+0x0/0x10
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02fbfd3>] quotad_check_timeo+0x2b/0x85 [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02fc163>] gfs2_quotad+0x136/0x24d [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81066133>] ? autoremove_wake_function+0x0/0x39
Feb 15 18:19:19 vhost1 kernel: [<ffffffffa02fc02d>] ? gfs2_quotad+0x0/0x24d [gfs2]
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81065cb9>] kthread+0x7f/0x87
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10
Feb 15 18:19:19 vhost1 kernel: [<ffffffff81065c3a>] ? kthread+0x0/0x87
Feb 15 18:19:19 vhost1 kernel: [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10
Feb 15 18:19:19 vhost1 kernel: INFO: task gfs2_quotad:3769 blocked for more than 120 seconds.
Feb 15 18:19:19 vhost1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 18:19:19 vhost1 kernel: gfs2_quotad D 000000000000000b 0 3769 2 0x00000080

Last edited by boeboe2005; 03-14-2011 at 09:22 AM. Reason: forgot
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
cman fencing issue. agallant Linux - Enterprise 3 09-15-2012 04:47 AM
RedHat Cluster suite ( HALVM) fencing& service Q. sklemer Linux - Enterprise 1 01-12-2010 07:37 AM
Clustering & Fencing in GFS Brad.Scalio@noaa.gov Linux - Software 1 07-08-2009 05:09 AM
xfce4 drag&drop behaviour nilleso Linux - Software 3 04-18-2007 09:54 PM
unpredictable wireless behaviour Suse 10.9 ndiswrapper 1.2 davidhilbert Linux - Wireless Networking 1 10-17-2005 04:22 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 09:01 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration