LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Enterprise Linux Forums > Linux - Enterprise
User Name
Password
Linux - Enterprise This forum is for all items relating to using Linux in the Enterprise.

Notices


Reply
  Search this Thread
Old 02-16-2010, 04:27 AM   #1
K_L
LQ Newbie
 
Registered: Feb 2010
Posts: 11

Rep: Reputation: 0
RHEL5 2 node failover cluster conf with quorum


Hello,

I have been trying to test RHEL5 clustering on VMWare ESXi. The idea is to replace two nodes of Windows Enterprise with Red Hat. But I'm having a problem demoing this environment. I'm using Conga to configure the cluster.

The problem is that if I crash the node with luci, the failover services do not reallocate. I propably just don't understand the concepts of votes and checking the health.

I have following configuration:
Shared Quorum disk
Interval 1
Votes 1 (man said nodes n - 1)
TKO 5 (should be quick)
Minimum Score 1
Label qdisk (both nodes see this with mkqdisk -L)
Heuristics I made it ping the gateway

Two nodes

Shared Fencing Fence_VMWare (had to do little fine tuning to get it work with ESXi 3.5)

Resources
GFS2 disk shared vmware disk
IP
Postgresql 8

Service DB

Failover domain including both nodes and the service DB.

If I shut down node 2 the service reallocates to node 1.
If I shut down node 1 the services go offline. On node 2 I see that services are online on node 1 even tho node 1 is offline.

I can't understand why this happens?
 
Old 02-16-2010, 09:41 AM   #2
hostmaster
Member
 
Registered: Feb 2007
Posts: 55

Rep: Reputation: 17
Post /etc/cluster/cluster.conf, RHEL version

Quote:
Originally Posted by K_L View Post
If I shut down node 2 the service reallocates to node 1.
If I shut down node 1 the services go offline. On node 2 I see that services are online on node 1 even tho node 1 is offline.

I can't understand why this happens?
Its very strange because node 2 should detect node one if it fails. Please post the result of clustat command when this happens and also /var/log/messages.
 
Old 02-17-2010, 07:35 AM   #3
K_L
LQ Newbie
 
Registered: Feb 2010
Posts: 11

Original Poster
Rep: Reputation: 0
Node 1 is 192.168.0.5
Node 2 is 192.168.0.6

Cluster.conf
Quote:
<?xml version="1.0"?>
<cluster alias="DarthBane" config_version="48" name="DarthBane">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="192.168.0.6" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="esxi" port="160" secure="1"/>
</method>
</fence>
</clusternode>
<clusternode name="192.168.0.5" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="esxi" port="144" secure="1"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="3"/>
<fencedevices>
<fencedevice agent="fence_vmware" ipaddr="xxx.yyy.zzz.fff" login="arse" name="esxi" passwd="xxx" vmlogin="root" vmpasswd="xxx"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="RuleOfTwo" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="192.168.0.6" priority="1"/>
<failoverdomainnode name="192.168.0.5" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<clusterfs device="/dev/sdb1" force_unmount="0" fsid="1313" fstype="gfs2" mountpoint="/data" name="pgdata1" self_fence="0"/>
<ip address="192.168.0.7" monitor_link="0"/>
<postgres-8 config_file="/data/postgresql.conf" name="PGDB" postmaster_user="postgres" shutdown_wait="0"/>
</resources>
<service autostart="1" domain="RuleOfTwo" exclusive="0" name="DB" recovery="relocate">
<ip ref="192.168.0.7"/>
<clusterfs fstype="gfs" ref="pgdata1"/>
<postgres-8 ref="PGDB"/>
</service>
<service autostart="1" domain="RuleOfTwo" exclusive="0" name="TestIP" recovery="relocate">
<ip address="192.168.0.8" monitor_link="0"/>
</service>
</rm>
<totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/>
<quorumd interval="1" label="qdisk" min_score="1" tko="5" votes="1">
<heuristic interval="2" program="ping -c3 -t2 xxx.yyy.zzz.vvv" score="1"/>
</quorumd>
</cluster>
Clustat
Quote:
Cluster Status for DarthBane @ Wed Feb 17 14:30:48 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
192.168.0.6 1 Online, Local, rgmanager
192.168.0.5 2 Offline
/dev/disk/by-path/pci-0000:00:10.0-scsi- 0 Online, Quorum Disk

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:EMSDB 192.168.0.5 started
service:TestIP 192.168.0.5 started
messages since node 1 goes offline
Quote:
Feb 17 14:30:13 DarthMalak qdiskd[1958]: <notice> Writing eviction notice for node 2
Feb 17 14:30:14 DarthMalak qdiskd[1958]: <notice> Node 2 evicted
Feb 17 14:30:17 DarthMalak openais[1939]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb 17 14:30:17 DarthMalak openais[1939]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Feb 17 14:30:17 DarthMalak openais[1939]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Feb 17 14:30:17 DarthMalak openais[1939]: [TOTEM] entering GATHER state from 2.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] entering GATHER state from 0.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Creating commit token because I am the rep.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Saving state aru 82 high seq received 82
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Storing new sequence id for ring 130
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] entering COMMIT state.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] entering RECOVERY state.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] position [0] member 192.168.0.6:
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] previous ring seq 300 rep 192.168.0.5
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] aru 82 high delivered 82 received flag 1
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Did not need to originate any messages in recovery.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Sending initial ORF token
Feb 17 14:30:24 DarthMalak fenced[1974]: 192.168.0.5 not a cluster member after 0 sec post_fail_delay
Feb 17 14:30:24 DarthMalak kernel: dlm: closing connection to node 2
Feb 17 14:30:23 DarthMalak openais[1939]: [CLM ] CLM CONFIGURATION CHANGE
Feb 17 14:30:24 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Trying to acquire journal lock...
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] New Configuration:
Feb 17 14:30:24 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Looking at journal...
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] r(0) ip(192.168.0.6)
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] Members Left:
Feb 17 14:30:24 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Acquiring the transaction lock...
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] r(0) ip(192.168.0.5)
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] Members Joined:
Feb 17 14:30:25 DarthMalak openais[1939]: [CLM ] CLM CONFIGURATION CHANGE
Feb 17 14:30:25 DarthMalak openais[1939]: [CLM ] New Configuration:
Feb 17 14:30:25 DarthMalak openais[1939]: [CLM ] r(0) ip(192.168.0.6)
Feb 17 14:30:26 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Replaying journal...
Feb 17 14:30:27 DarthMalak openais[1939]: [CLM ] Members Left:
Feb 17 14:30:27 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Replayed 10 of 10 blocks
Feb 17 14:30:27 DarthMalak openais[1939]: [CLM ] Members Joined:
Feb 17 14:30:27 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Found 1 revoke tags
Feb 17 14:30:27 DarthMalak openais[1939]: [SYNC ] This node is within the primary component and will provide service.
Feb 17 14:30:27 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Journal replayed in 2s
Feb 17 14:30:27 DarthMalak openais[1939]: [TOTEM] entering OPERATIONAL state.
Feb 17 14:30:27 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Done
Feb 17 14:30:27 DarthMalak openais[1939]: [CLM ] got nodejoin message 192.168.0.6
Feb 17 14:30:27 DarthMalak openais[1939]: [CPG ] got joinlist message from node 1
If I restart node 1 then services move to node 2.

Last edited by K_L; 02-17-2010 at 07:37 AM.
 
Old 02-17-2010, 09:33 AM   #4
hostmaster
Member
 
Registered: Feb 2007
Posts: 55

Rep: Reputation: 17
Is that all the logs ? No Logs related to clurgmgrd. What is the maximum time you took node 2 offline ? RHEL version major.minor
 
Old 02-17-2010, 11:10 AM   #5
K_L
LQ Newbie
 
Registered: Feb 2010
Posts: 11

Original Poster
Rep: Reputation: 0
I'm running RHEL 5.4 x86_64
The node has been offline for since I made that post. Still in similar condition.
There was no clurgmgrd related messages since the other node went down.

Maybe I'm missing something from from the other node or the conf is simply done wrong. My luck is that this is not prodution environment
 
Old 02-18-2010, 02:43 AM   #6
hostmaster
Member
 
Registered: Feb 2007
Posts: 55

Rep: Reputation: 17
I had a similar problem with RHEL 5.4 x86_64 recently where clurgmgrd did not recognized newly added services on all 4 cluster nodes. A reboot on all nodes solved the issue and did not face the problem since then. Have u tried rebooting both machines, if not I recommend try it once and see what happens.
 
Old 02-18-2010, 08:16 AM   #7
K_L
LQ Newbie
 
Registered: Feb 2010
Posts: 11

Original Poster
Rep: Reputation: 0
Hello and thanks for answering.

I tried to reboot the both nodes and test again. I still get the same result. Services stay at offline node.

When the problem node is restarted the services do move to the online node.

Quote:
Feb 18 15:01:51 DarthMalak kernel: dlm: got connection from 2
Feb 18 15:02:10 DarthMalak clurgmgrd[2499]: <notice> Recovering failed service service:TestIP
Feb 18 15:02:12 DarthMalak avahi-daemon[2419]: Registering new address record for 192.168.0.8 on eth1.
Feb 18 15:02:12 DarthMalak clurgmgrd[2499]: <notice> Recovering failed service service:EMSDB
Feb 18 15:02:17 DarthMalak clurgmgrd[2499]: <notice> Service service:TestIP started
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "DarthBane:pgdata1"
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.1: Joined cluster. Now mounting FS...
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.1: jid=1, already locked for use
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.1: jid=1: Looking at journal...
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.1: jid=1: Done
Feb 18 15:02:19 DarthMalak avahi-daemon[2419]: Registering new address record for 192.168.0.7 on eth1.
Feb 18 15:02:24 DarthMalak clurgmgrd[2499]: <notice> Service service:EMSDB started
I wanted to see if it still works the other way round

Quote:
Cluster Status for DarthBane @ Thu Feb 18 15:08:56 2010
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
192.168.0.6 1 Offline
192.168.0.5 2 Online, Local, rgmanager
/dev/disk/by-path/pci-0000:00:10.0-scsi- 0 Online, Quorum Disk

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:EMSDB 192.168.0.5 started
service:TestIP 192.168.0.5 started
Services moved

messages log
Quote:
Feb 18 15:07:14 DarthRevan qdiskd[1990]: <info> Assuming master role
Feb 18 15:07:15 DarthRevan qdiskd[1990]: <notice> Writing eviction notice for node 1
Feb 18 15:07:16 DarthRevan qdiskd[1990]: <notice> Node 1 evicted
Feb 18 15:07:16 DarthRevan openais[1971]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb 18 15:07:16 DarthRevan openais[1971]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Feb 18 15:07:16 DarthRevan openais[1971]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Feb 18 15:07:16 DarthRevan openais[1971]: [TOTEM] entering GATHER state from 2.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] entering GATHER state from 0.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Creating commit token because I am the rep.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Saving state aru 73 high seq received 73
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Storing new sequence id for ring 15c
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] entering COMMIT state.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] entering RECOVERY state.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] position [0] member 192.168.0.5:
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] previous ring seq 344 rep 192.168.0.5
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] aru 73 high delivered 73 received flag 1
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Did not need to originate any messages in recovery.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Sending initial ORF token
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] CLM CONFIGURATION CHANGE
Feb 18 15:07:21 DarthRevan fenced[2006]: 192.168.0.6 not a cluster member after 0 sec post_fail_delay
Feb 18 15:07:21 DarthRevan kernel: dlm: closing connection to node 1
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] New Configuration:
Feb 18 15:07:21 DarthRevan fenced[2006]: fencing node "192.168.0.6"
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.5)
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] Members Left:
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.6)
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] Members Joined:
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] CLM CONFIGURATION CHANGE
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] New Configuration:
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.5)
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] Members Left:
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] Members Joined:
Feb 18 15:07:21 DarthRevan openais[1971]: [SYNC ] This node is within the primary component and will provide service.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] entering OPERATIONAL state.
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] got nodejoin message 192.168.0.5
Feb 18 15:07:21 DarthRevan openais[1971]: [CPG ] got joinlist message from node 2
Feb 18 15:07:36 DarthRevan fenced[2006]: agent "fence_vmware" reports: Connection timed out
Feb 18 15:07:36 DarthRevan fenced[2006]: fence "192.168.0.6" failed
Feb 18 15:07:41 DarthRevan fenced[2006]: fencing node "192.168.0.6"
Feb 18 15:07:58 DarthRevan fenced[2006]: fence "192.168.0.6" success
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Trying to acquire journal lock...
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Looking at journal...
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Acquiring the transaction lock...
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Replaying journal...
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Replayed 0 of 0 blocks
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Found 0 revoke tags
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Journal replayed in 0s
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Done
Feb 18 15:07:59 DarthRevan clurgmgrd[2591]: <notice> Taking over service service:EMSDB from down member 192.168.0.6
Feb 18 15:07:59 DarthRevan clurgmgrd[2591]: <notice> Taking over service service:TestIP from down member 192.168.0.6
Feb 18 15:07:59 DarthRevan avahi-daemon[2511]: Registering new address record for 192.168.0.8 on eth1.
Feb 18 15:07:59 DarthRevan avahi-daemon[2511]: Registering new address record for 192.168.0.7 on eth1.
Feb 18 15:08:00 DarthRevan clurgmgrd[2591]: <notice> Service service:TestIP started
Feb 18 15:08:01 DarthRevan clurgmgrd[2591]: <notice> Service service:EMSDB started
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] entering GATHER state from 11.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Creating commit token because I am the rep.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Saving state aru 26 high seq received 26
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Storing new sequence id for ring 160
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] entering COMMIT state.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] entering RECOVERY state.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] position [0] member 192.168.0.5:
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] previous ring seq 348 rep 192.168.0.5
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] aru 26 high delivered 26 received flag 1
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] position [1] member 192.168.0.6:
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] previous ring seq 348 rep 192.168.0.6
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] aru a high delivered a received flag 1
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Did not need to originate any messages in recovery.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Sending initial ORF token
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] CLM CONFIGURATION CHANGE
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] New Configuration:
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.5)
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] Members Left:
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] Members Joined:
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] CLM CONFIGURATION CHANGE
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] New Configuration:
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.5)
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.6)
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] Members Left:
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] Members Joined:
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.6)
Feb 18 15:09:08 DarthRevan openais[1971]: [SYNC ] This node is within the primary component and will provide service.
Feb 18 15:09:08 DarthRevan openais[1971]: [TOTEM] entering OPERATIONAL state.
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] got nodejoin message 192.168.0.5
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] got nodejoin message 192.168.0.6
Feb 18 15:09:08 DarthRevan openais[1971]: [CPG ] got joinlist message from node 2
Feb 18 15:09:18 DarthRevan qdiskd[1990]: <warning> qdisk cycle took more than 1 second to complete (5.940000)
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <info> Node 1 is the master
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <warning> Master conflict: abdicating
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <warning> qdisk cycle took more than 1 second to complete (1.780000)
Feb 18 15:10:38 DarthRevan kernel: dlm: got connection from 1
Feb 18 15:11:01 DarthRevan clurgmgrd[2591]: <err> #37: Error receiving header from 1 sz=0 CTX 0x29cf9c0
Feb 18 15:11:01 DarthRevan openais[1971]: [TOTEM] Retransmit List: 28
As we can see after the node fails, the vmware_fence kicks in.
The services move to available node and the failed node is restarted.

Last edited by K_L; 02-18-2010 at 08:17 AM.
 
Old 02-18-2010, 12:47 PM   #8
hostmaster
Member
 
Registered: Feb 2007
Posts: 55

Rep: Reputation: 17
Please attach your /var/log/messages from both machines. Also following commands from both nodes
cman_tool status
cman_tool nodes
Have you manually tried to relocate resource after this problem (clusvcadm -r [servicename]) ?
Why are you using GFS in a failover environment. GFS is used when you need 2 nodes to have access to filesystem simultaneously.
 
Old 02-18-2010, 04:17 PM   #9
elcody02
Member
 
Registered: Jun 2007
Posts: 52

Rep: Reputation: 17
What makes me wonder are the qdisk problems:

Quote:
Feb 18 15:09:18 DarthRevan qdiskd[1990]: <warning> qdisk cycle took more than 1 second to complete (5.940000)
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <info> Node 1 is the master
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <warning> Master conflict: abdicating
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <warning> qdisk cycle took more than 1 second to complete (1.780000)
Feb 18 15:10:38 DarthRevan kernel: dlm: got connection from 1
Feb 18 15:11:01 DarthRevan clurgmgrd[2591]: <err> #37: Error receiving header from 1 sz=0 CTX 0x29cf9c0
Feb 18 15:11:01 DarthRevan openais[1971]: [TOTEM] Retransmit List: 28
I would say you should configure logging both for rgmanager and cman, openais like as follows:

Quote:
<rm log_level="7">
..
</rm>
<logging syslog_facility="local4">
<logger ident="CPG" debug="on" syslog_facility="local4"/>
<logger ident="CMAN" debug="on" syslog_facility="local4"/>
</logging>
Also add something like

Quote:
*.debug -/var/log/debug
to /etc/syslog.conf and restart syslogd.

Then it might be possible to tell more.

And did you have any specific reason to specify those totem protocol parameters?

Last edited by elcody02; 02-18-2010 at 04:32 PM. Reason: totem parameters??
 
Old 02-19-2010, 04:07 AM   #10
K_L
LQ Newbie
 
Registered: Feb 2010
Posts: 11

Original Poster
Rep: Reputation: 0
Hello,

I am able to reallocate resources to the problem node. And if I do normal shutdown the services move. Problem seems to exist only when the node crashes. I have demoed the crash with simple power off from VMWare ESXi.

The reason from using GFS is that I was doing other demoing related to shared disk.

cman_tool status
Quote:
Version: 6.2.0
Config Version: 48
Cluster Name: DarthBane
Cluster Id: 43733
Cluster Member: Yes
Cluster Generation: 372
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Quorum: 2
Active subsystems: 10
Flags: Dirty
Ports Bound: 0 11 177
Node name: 192.168.0.5
Node ID: 2
Multicast addresses: 239.192.170.128
Node addresses: 192.168.0.5


Version: 6.2.0
Config Version: 48
Cluster Name: DarthBane
Cluster Id: 43733
Cluster Member: Yes
Cluster Generation: 372
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Quorum: 2
Active subsystems: 9
Flags: Dirty
Ports Bound: 0
Node name: 192.168.0.6
Node ID: 1
Multicast addresses: 239.192.170.128
Node addresses: 192.168.0.6
cman_tool nodes
Quote:
Node Sts Inc Joined Name
0 M 0 2010-02-19 08:07:28 /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:2:0-part1
1 M 368 2010-02-19 08:16:31 192.168.0.6
2 M 356 2010-02-19 08:07:10 192.168.0.5


Node Sts Inc Joined Name
0 M 0 2010-02-19 08:16:51 /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:2:0-part1
1 M 368 2010-02-19 08:16:31 192.168.0.6
2 M 368 2010-02-19 08:16:31 192.168.0.5
I added the debuging. This line wasn't in the other tests

Quote:
Feb 19 09:48:43 DarthMalak clurgmgrd[2495]: <info> Waiting for node #2 to be fenced
I waited for bit over 10 minutes but nothing happened. I checked the fence config and it was correct.

Quote:
[root@DarthMalak ~]# fence_vmware -x -a xxx.yyy.zzz.fff -l user -p pass -L user -P pass -n 144 -o status
Status: OFF
I tried to fence_vmware my self and it went to timeout. This is because the fence_vmware timeout is too short. The fence command was successful and the node rebooted.

I was unable to attach entire log files due to size problems.
Malak debug http://pastebin.com/m48b06f35
Revan debug http://pastebin.com/m1ef95d51

Last edited by K_L; 02-19-2010 at 04:12 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
mysqld node of mysql cluster system not connecting to management node coal-fire-ice Linux - Server 1 07-27-2015 09:33 AM
Can't find local node name in cluster.conf Roboam León López Linux - Enterprise 17 09-14-2012 09:50 AM
Running groupd causes initially connected node of Red Hat cluster to disconnect RHEL5 psteele555 Linux - Software 0 08-11-2008 02:26 PM
LXer: Enhancing cluster quorum with QDisk LXer Syndicated Linux News 0 12-20-2007 12:41 AM
RedHat Cluster Suite - No Quorum vamosb Linux - Enterprise 9 12-09-2005 08:16 AM

LinuxQuestions.org > Forums > Enterprise Linux Forums > Linux - Enterprise

All times are GMT -5. The time now is 02:45 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration