Linux - EnterpriseThis forum is for all items relating to using Linux in the Enterprise.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
I have been trying to test RHEL5 clustering on VMWare ESXi. The idea is to replace two nodes of Windows Enterprise with Red Hat. But I'm having a problem demoing this environment. I'm using Conga to configure the cluster.
The problem is that if I crash the node with luci, the failover services do not reallocate. I propably just don't understand the concepts of votes and checking the health.
I have following configuration:
Shared Quorum disk
Interval 1
Votes 1 (man said nodes n - 1)
TKO 5 (should be quick)
Minimum Score 1
Label qdisk (both nodes see this with mkqdisk -L)
Heuristics I made it ping the gateway
Two nodes
Shared Fencing Fence_VMWare (had to do little fine tuning to get it work with ESXi 3.5)
Resources
GFS2 disk shared vmware disk
IP
Postgresql 8
Service DB
Failover domain including both nodes and the service DB.
If I shut down node 2 the service reallocates to node 1.
If I shut down node 1 the services go offline. On node 2 I see that services are online on node 1 even tho node 1 is offline.
If I shut down node 2 the service reallocates to node 1.
If I shut down node 1 the services go offline. On node 2 I see that services are online on node 1 even tho node 1 is offline.
I can't understand why this happens?
Its very strange because node 2 should detect node one if it fails. Please post the result of clustat command when this happens and also /var/log/messages.
Cluster Status for DarthBane @ Wed Feb 17 14:30:48 2010
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
192.168.0.6 1 Online, Local, rgmanager
192.168.0.5 2 Offline
/dev/disk/by-path/pci-0000:00:10.0-scsi- 0 Online, Quorum Disk
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:EMSDB 192.168.0.5 started
service:TestIP 192.168.0.5 started
messages since node 1 goes offline
Quote:
Feb 17 14:30:13 DarthMalak qdiskd[1958]: <notice> Writing eviction notice for node 2
Feb 17 14:30:14 DarthMalak qdiskd[1958]: <notice> Node 2 evicted
Feb 17 14:30:17 DarthMalak openais[1939]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb 17 14:30:17 DarthMalak openais[1939]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Feb 17 14:30:17 DarthMalak openais[1939]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Feb 17 14:30:17 DarthMalak openais[1939]: [TOTEM] entering GATHER state from 2.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] entering GATHER state from 0.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Creating commit token because I am the rep.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Saving state aru 82 high seq received 82
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Storing new sequence id for ring 130
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] entering COMMIT state.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] entering RECOVERY state.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] position [0] member 192.168.0.6:
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] previous ring seq 300 rep 192.168.0.5
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] aru 82 high delivered 82 received flag 1
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Did not need to originate any messages in recovery.
Feb 17 14:30:23 DarthMalak openais[1939]: [TOTEM] Sending initial ORF token
Feb 17 14:30:24 DarthMalak fenced[1974]: 192.168.0.5 not a cluster member after 0 sec post_fail_delay
Feb 17 14:30:24 DarthMalak kernel: dlm: closing connection to node 2
Feb 17 14:30:23 DarthMalak openais[1939]: [CLM ] CLM CONFIGURATION CHANGE
Feb 17 14:30:24 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Trying to acquire journal lock...
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] New Configuration:
Feb 17 14:30:24 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Looking at journal...
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] r(0) ip(192.168.0.6)
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] Members Left:
Feb 17 14:30:24 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Acquiring the transaction lock...
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] r(0) ip(192.168.0.5)
Feb 17 14:30:24 DarthMalak openais[1939]: [CLM ] Members Joined:
Feb 17 14:30:25 DarthMalak openais[1939]: [CLM ] CLM CONFIGURATION CHANGE
Feb 17 14:30:25 DarthMalak openais[1939]: [CLM ] New Configuration:
Feb 17 14:30:25 DarthMalak openais[1939]: [CLM ] r(0) ip(192.168.0.6)
Feb 17 14:30:26 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Replaying journal...
Feb 17 14:30:27 DarthMalak openais[1939]: [CLM ] Members Left:
Feb 17 14:30:27 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Replayed 10 of 10 blocks
Feb 17 14:30:27 DarthMalak openais[1939]: [CLM ] Members Joined:
Feb 17 14:30:27 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Found 1 revoke tags
Feb 17 14:30:27 DarthMalak openais[1939]: [SYNC ] This node is within the primary component and will provide service.
Feb 17 14:30:27 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Journal replayed in 2s
Feb 17 14:30:27 DarthMalak openais[1939]: [TOTEM] entering OPERATIONAL state.
Feb 17 14:30:27 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Done
Feb 17 14:30:27 DarthMalak openais[1939]: [CLM ] got nodejoin message 192.168.0.6
Feb 17 14:30:27 DarthMalak openais[1939]: [CPG ] got joinlist message from node 1
I'm running RHEL 5.4 x86_64
The node has been offline for since I made that post. Still in similar condition.
There was no clurgmgrd related messages since the other node went down.
Maybe I'm missing something from from the other node or the conf is simply done wrong. My luck is that this is not prodution environment
I had a similar problem with RHEL 5.4 x86_64 recently where clurgmgrd did not recognized newly added services on all 4 cluster nodes. A reboot on all nodes solved the issue and did not face the problem since then. Have u tried rebooting both machines, if not I recommend try it once and see what happens.
I tried to reboot the both nodes and test again. I still get the same result. Services stay at offline node.
When the problem node is restarted the services do move to the online node.
Quote:
Feb 18 15:01:51 DarthMalak kernel: dlm: got connection from 2
Feb 18 15:02:10 DarthMalak clurgmgrd[2499]: <notice> Recovering failed service service:TestIP
Feb 18 15:02:12 DarthMalak avahi-daemon[2419]: Registering new address record for 192.168.0.8 on eth1.
Feb 18 15:02:12 DarthMalak clurgmgrd[2499]: <notice> Recovering failed service service:EMSDB
Feb 18 15:02:17 DarthMalak clurgmgrd[2499]: <notice> Service service:TestIP started
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "DarthBane:pgdata1"
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.1: Joined cluster. Now mounting FS...
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.1: jid=1, already locked for use
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.1: jid=1: Looking at journal...
Feb 18 15:02:18 DarthMalak kernel: GFS2: fsid=DarthBane:pgdata1.1: jid=1: Done
Feb 18 15:02:19 DarthMalak avahi-daemon[2419]: Registering new address record for 192.168.0.7 on eth1.
Feb 18 15:02:24 DarthMalak clurgmgrd[2499]: <notice> Service service:EMSDB started
I wanted to see if it still works the other way round
Quote:
Cluster Status for DarthBane @ Thu Feb 18 15:08:56 2010
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
192.168.0.6 1 Offline
192.168.0.5 2 Online, Local, rgmanager
/dev/disk/by-path/pci-0000:00:10.0-scsi- 0 Online, Quorum Disk
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:EMSDB 192.168.0.5 started
service:TestIP 192.168.0.5 started
Services moved
messages log
Quote:
Feb 18 15:07:14 DarthRevan qdiskd[1990]: <info> Assuming master role
Feb 18 15:07:15 DarthRevan qdiskd[1990]: <notice> Writing eviction notice for node 1
Feb 18 15:07:16 DarthRevan qdiskd[1990]: <notice> Node 1 evicted
Feb 18 15:07:16 DarthRevan openais[1971]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb 18 15:07:16 DarthRevan openais[1971]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Feb 18 15:07:16 DarthRevan openais[1971]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Feb 18 15:07:16 DarthRevan openais[1971]: [TOTEM] entering GATHER state from 2.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] entering GATHER state from 0.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Creating commit token because I am the rep.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Saving state aru 73 high seq received 73
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Storing new sequence id for ring 15c
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] entering COMMIT state.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] entering RECOVERY state.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] position [0] member 192.168.0.5:
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] previous ring seq 344 rep 192.168.0.5
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] aru 73 high delivered 73 received flag 1
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Did not need to originate any messages in recovery.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] Sending initial ORF token
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] CLM CONFIGURATION CHANGE
Feb 18 15:07:21 DarthRevan fenced[2006]: 192.168.0.6 not a cluster member after 0 sec post_fail_delay
Feb 18 15:07:21 DarthRevan kernel: dlm: closing connection to node 1
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] New Configuration:
Feb 18 15:07:21 DarthRevan fenced[2006]: fencing node "192.168.0.6"
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.5)
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] Members Left:
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.6)
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] Members Joined:
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] CLM CONFIGURATION CHANGE
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] New Configuration:
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.5)
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] Members Left:
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] Members Joined:
Feb 18 15:07:21 DarthRevan openais[1971]: [SYNC ] This node is within the primary component and will provide service.
Feb 18 15:07:21 DarthRevan openais[1971]: [TOTEM] entering OPERATIONAL state.
Feb 18 15:07:21 DarthRevan openais[1971]: [CLM ] got nodejoin message 192.168.0.5
Feb 18 15:07:21 DarthRevan openais[1971]: [CPG ] got joinlist message from node 2
Feb 18 15:07:36 DarthRevan fenced[2006]: agent "fence_vmware" reports: Connection timed out
Feb 18 15:07:36 DarthRevan fenced[2006]: fence "192.168.0.6" failed
Feb 18 15:07:41 DarthRevan fenced[2006]: fencing node "192.168.0.6"
Feb 18 15:07:58 DarthRevan fenced[2006]: fence "192.168.0.6" success
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Trying to acquire journal lock...
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Looking at journal...
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Acquiring the transaction lock...
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Replaying journal...
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Replayed 0 of 0 blocks
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Found 0 revoke tags
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Journal replayed in 0s
Feb 18 15:07:58 DarthRevan kernel: GFS2: fsid=DarthBane:pgdata1.0: jid=1: Done
Feb 18 15:07:59 DarthRevan clurgmgrd[2591]: <notice> Taking over service service:EMSDB from down member 192.168.0.6
Feb 18 15:07:59 DarthRevan clurgmgrd[2591]: <notice> Taking over service service:TestIP from down member 192.168.0.6
Feb 18 15:07:59 DarthRevan avahi-daemon[2511]: Registering new address record for 192.168.0.8 on eth1.
Feb 18 15:07:59 DarthRevan avahi-daemon[2511]: Registering new address record for 192.168.0.7 on eth1.
Feb 18 15:08:00 DarthRevan clurgmgrd[2591]: <notice> Service service:TestIP started
Feb 18 15:08:01 DarthRevan clurgmgrd[2591]: <notice> Service service:EMSDB started
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] entering GATHER state from 11.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Creating commit token because I am the rep.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Saving state aru 26 high seq received 26
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Storing new sequence id for ring 160
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] entering COMMIT state.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] entering RECOVERY state.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] position [0] member 192.168.0.5:
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] previous ring seq 348 rep 192.168.0.5
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] aru 26 high delivered 26 received flag 1
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] position [1] member 192.168.0.6:
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] previous ring seq 348 rep 192.168.0.6
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] aru a high delivered a received flag 1
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Did not need to originate any messages in recovery.
Feb 18 15:09:07 DarthRevan openais[1971]: [TOTEM] Sending initial ORF token
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] CLM CONFIGURATION CHANGE
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] New Configuration:
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.5)
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] Members Left:
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] Members Joined:
Feb 18 15:09:07 DarthRevan openais[1971]: [CLM ] CLM CONFIGURATION CHANGE
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] New Configuration:
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.5)
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.6)
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] Members Left:
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] Members Joined:
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] r(0) ip(192.168.0.6)
Feb 18 15:09:08 DarthRevan openais[1971]: [SYNC ] This node is within the primary component and will provide service.
Feb 18 15:09:08 DarthRevan openais[1971]: [TOTEM] entering OPERATIONAL state.
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] got nodejoin message 192.168.0.5
Feb 18 15:09:08 DarthRevan openais[1971]: [CLM ] got nodejoin message 192.168.0.6
Feb 18 15:09:08 DarthRevan openais[1971]: [CPG ] got joinlist message from node 2
Feb 18 15:09:18 DarthRevan qdiskd[1990]: <warning> qdisk cycle took more than 1 second to complete (5.940000)
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <info> Node 1 is the master
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <warning> Master conflict: abdicating
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <warning> qdisk cycle took more than 1 second to complete (1.780000)
Feb 18 15:10:38 DarthRevan kernel: dlm: got connection from 1
Feb 18 15:11:01 DarthRevan clurgmgrd[2591]: <err> #37: Error receiving header from 1 sz=0 CTX 0x29cf9c0
Feb 18 15:11:01 DarthRevan openais[1971]: [TOTEM] Retransmit List: 28
As we can see after the node fails, the vmware_fence kicks in.
The services move to available node and the failed node is restarted.
Please attach your /var/log/messages from both machines. Also following commands from both nodes
cman_tool status
cman_tool nodes
Have you manually tried to relocate resource after this problem (clusvcadm -r [servicename]) ?
Why are you using GFS in a failover environment. GFS is used when you need 2 nodes to have access to filesystem simultaneously.
Feb 18 15:09:18 DarthRevan qdiskd[1990]: <warning> qdisk cycle took more than 1 second to complete (5.940000)
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <info> Node 1 is the master
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <warning> Master conflict: abdicating
Feb 18 15:09:22 DarthRevan qdiskd[1990]: <warning> qdisk cycle took more than 1 second to complete (1.780000)
Feb 18 15:10:38 DarthRevan kernel: dlm: got connection from 1
Feb 18 15:11:01 DarthRevan clurgmgrd[2591]: <err> #37: Error receiving header from 1 sz=0 CTX 0x29cf9c0
Feb 18 15:11:01 DarthRevan openais[1971]: [TOTEM] Retransmit List: 28
I would say you should configure logging both for rgmanager and cman, openais like as follows:
I am able to reallocate resources to the problem node. And if I do normal shutdown the services move. Problem seems to exist only when the node crashes. I have demoed the crash with simple power off from VMWare ESXi.
The reason from using GFS is that I was doing other demoing related to shared disk.
Node Sts Inc Joined Name
0 M 0 2010-02-19 08:07:28 /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:2:0-part1
1 M 368 2010-02-19 08:16:31 192.168.0.6
2 M 356 2010-02-19 08:07:10 192.168.0.5
Node Sts Inc Joined Name
0 M 0 2010-02-19 08:16:51 /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:2:0-part1
1 M 368 2010-02-19 08:16:31 192.168.0.6
2 M 368 2010-02-19 08:16:31 192.168.0.5
I added the debuging. This line wasn't in the other tests
Quote:
Feb 19 09:48:43 DarthMalak clurgmgrd[2495]: <info> Waiting for node #2 to be fenced
I waited for bit over 10 minutes but nothing happened. I checked the fence config and it was correct.
Quote:
[root@DarthMalak ~]# fence_vmware -x -a xxx.yyy.zzz.fff -l user -p pass -L user -P pass -n 144 -o status
Status: OFF
I tried to fence_vmware my self and it went to timeout. This is because the fence_vmware timeout is too short. The fence command was successful and the node rebooted.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.