For some time now I have been trying to set up a Red Hat cluster (to test GFS capabilities) on RHEL5. I've gotten to the point where I am following this document:
http://sources.redhat.com/git/?p=clu...db3;hb=STABLE2 because the rest of the documentation that I have been able to find relates mostly to RHEL4.1.
I am following this procedure:
Quote:
79 Startup procedure
80 -----------------
81
82 To use the init script, run "service cman start" on all nodes. When getting a
83 cluster set up initially, it can be helpful to do all the steps manually as
84 follows.
85
86 Run these commands on each cluster node:
87
88 > mount -t configfs none /sys/kernel/config
89 > ccsd
90 > cman_tool join
91 > groupd
92 > fenced
93 > fence_tool join
94 > dlm_controld
95 > gfs_controld
96 > clvmd (only necessary if using clvm volumes for gfs)
97 > mkfs -t gfs -p lock_dlm -t <clustername>:<fsname> -j <#journals> <blockdev>
98 > mount -t gfs [-v] <blockdev> <mountpoint>
|
(emphasis added.)
Initially I had a two-node cluster that had 2 expected votes, 1 each. After the two nodes, node4008a4 and node4003a6, connected to each other, I ran groupd on both. They connected again, after a little while, but I got this message:
Quote:
openais[3136]: [MAIN ] Node node4008a4.localdomain not joined to
> cman because it has rejoined an inquorate cluster
|
cman_tool status on both machines showed them as connected, but with nodes disallowed. So I changed the votes for node4008a4 to 2 and the expected votes for the cluster to 3. I rebooted and tried again. This time, node4008a4 showed the following:
[QUOTE]Aug 11 13:15:08 node4008a4 openais[4332]: [TOTEM] Retransmit List: 15
Aug 11 13:15:15 node4008a4 last message repeated 49 times
For some time now I have been trying to set up a Red Hat cluster (to test GFS capabilities) on RHEL5. I've gotten to the point where I am following this document:
http://sources.redhat.com/git/?p=clu...db3;hb=STABLE2 because the rest of the documentation that I have been able to find relates mostly to RHEL4.1.
I am following this procedure:
Quote:
79 Startup procedure
80 -----------------
81
82 To use the init script, run "service cman start" on all nodes. When getting a
83 cluster set up initially, it can be helpful to do all the steps manually as
84 follows.
85
86 Run these commands on each cluster node:
87
88 > mount -t configfs none /sys/kernel/config
89 > ccsd
90 > cman_tool join
91 > groupd
92 > fenced
93 > fence_tool join
94 > dlm_controld
95 > gfs_controld
96 > clvmd (only necessary if using clvm volumes for gfs)
97 > mkfs -t gfs -p lock_dlm -t <clustername>:<fsname> -j <#journals> <blockdev>
98 > mount -t gfs [-v] <blockdev> <mountpoint>
|
(emphasis added.)
Initially I had a two-node cluster that had 2 expected votes, 1 each. After the two nodes, node4008a4 and node4003a6, connected to each other, I ran groupd on both. They connected again, after a little while, but I got this message:
Quote:
openais[3136]: [MAIN ] Node node4008a4.localdomain not joined to
> cman because it has rejoined an inquorate cluster
|
cman_tool status on both machines showed them as connected, but with nodes disallowed. So I changed the votes for node4008a4 to 2 and the expected votes for the cluster to 3. I rebooted and tried again. This time, node4008a4 showed the following:
Quote:
Aug 11 13:15:08 node4008a4 openais[4332]: [TOTEM] Retransmit List: 15
Aug 11 13:15:15 node4008a4 last message repeated 49 times
Aug 11 13:15:25 node4008a4 openais[4332]: [TOTEM] Retransmit List: 15
Aug 11 13:15:25 node4008a4 openais[4332]: [TOTEM] FAILED TO RECEIVE
Aug 11 13:15:25 node4008a4 openais[4332]: [TOTEM] entering GATHER state from 6.
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] entering GATHER state from 0.
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] Creating commit token because I am the rep.
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] Saving state aru 15 high seq received 16
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] entering COMMIT state.
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] entering RECOVERY state.
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] position [0] member 10.125.8.4:
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] previous ring seq 8 rep 10.125.3.6
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] aru 15 high delivered 15 received flag 0
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] copying all old ring messages from 16-16.
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] Originated 0 messages in RECOVERY.
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] Originated for recovery:
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] Not Originated for recovery: 16
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] Storing new sequence id for ring c
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] Sending initial ORF token
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] CLM CONFIGURATION CHANGE
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] New Configuration:
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] r(0) ip(10.125.8.4)
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] Members Left:
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] r(0) ip(10.125.3.6)
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] Members Joined:
Aug 11 13:15:29 node4008a4 openais[4332]: [SYNC ] This node is within the primary component and will provide service.
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] CLM CONFIGURATION CHANGE
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] New Configuration:
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] r(0) ip(10.125.8.4)
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] Members Left:
Aug 11 13:15:29 node4008a4 openais[4332]: [CLM ] Members Joined:
Aug 11 13:15:29 node4008a4 openais[4332]: [SYNC ] This node is within the primary component and will provide service.
Aug 11 13:15:29 node4008a4 openais[4332]: [TOTEM] entering OPERATIONAL state.
Aug 11 13:15:30 node4008a4 openais[4332]: [CLM ] got nodejoin message 10.125.8.4
Aug 11 13:15:30 node4008a4 openais[4332]: [CPG ] got joinlist message from node 1
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] entering GATHER state from 11.
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] Saving state aru c high seq received c
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] entering COMMIT state.
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] entering RECOVERY state.
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] position [0] member 10.125.3.6:
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] previous ring seq 12 rep 10.125.3.6
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] aru d high delivered d received flag 0
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] position [1] member 10.125.8.4:
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] previous ring seq 12 rep 10.125.8.4
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] aru c high delivered c received flag 0
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] Did not need to originate any messages in recovery.
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] Storing new sequence id for ring 10
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] CLM CONFIGURATION CHANGE
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] New Configuration:
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] r(0) ip(10.125.8.4)
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] Members Left:
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] Members Joined:
Aug 11 13:19:45 node4008a4 openais[4332]: [SYNC ] This node is within the primary component and will provide service.
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] CLM CONFIGURATION CHANGE
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] New Configuration:
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] r(0) ip(10.125.3.6)
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] r(0) ip(10.125.8.4)
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] Members Left:
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] Members Joined:
Aug 11 13:19:45 node4008a4 openais[4332]: [CLM ] r(0) ip(10.125.3.6)
Aug 11 13:19:45 node4008a4 openais[4332]: [SYNC ] This node is within the primary component and will provide service.
Aug 11 13:19:45 node4008a4 openais[4332]: [TOTEM] entering OPERATIONAL state.
Aug 11 13:19:45 node4008a4 openais[4332]: [MAIN ] Killing node node4003a6.localdomain because it has rejoined the cluster without cman_tool join
|
and node4003a6 showed the following:
Quote:
Aug 10 13:38:15 node4003a6 ccsd[3953]: Cluster is not quorate. Refusing connection.
Aug 10 13:38:15 node4003a6 ccsd[3953]: Error while processing connect: Connection refused
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] entering GATHER state from 9.
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] Creating commit token because I am the rep.
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] Saving state aru d high seq received d
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] entering COMMIT state.
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] entering RECOVERY state.
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] position [0] member 10.125.3.6:
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] previous ring seq 12 rep 10.125.3.6
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] aru d high delivered d received flag 0
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] position [1] member 10.125.8.4:
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] previous ring seq 12 rep 10.125.8.4
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] aru c high delivered c received flag 0
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] Did not need to originate any messages in recovery.
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] Storing new sequence id for ring 10
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] Sending initial ORF token
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] CLM CONFIGURATION CHANGE
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] New Configuration:
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] r(0) ip(10.125.3.6)
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] Members Left:
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] Members Joined:
Aug 10 13:38:16 node4003a6 openais[3983]: [SYNC ] This node is within the primary component and will provide service.
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] CLM CONFIGURATION CHANGE
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] New Configuration:
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] r(0) ip(10.125.3.6)
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] r(0) ip(10.125.8.4)
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] Members Left:
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] Members Joined:
Aug 10 13:38:16 node4003a6 openais[3983]: [CLM ] r(0) ip(10.125.8.4)
Aug 10 13:38:16 node4003a6 openais[3983]: [SYNC ] This node is within the primary component and will provide service.
Aug 10 13:38:16 node4003a6 openais[3983]: [TOTEM] entering OPERATIONAL state.
Aug 10 13:38:16 node4003a6 openais[3983]: [MAIN ] Node node4008a4.localdomain not joined to cman because it has rejoined an inquorate cluster
Aug 10 13:38:16 node4003a6 openais[3983]: [CMAN ] cman killed by node 1 for reason 3
Aug 10 13:38:21 node4003a6 ccsd[3953]: Unable to connect to cluster infrastructure after 30 seconds.
|
My question is, why does groupd even cause node4003a6 to disconnect? groupd is not mentioned in any of the docs I have seen except for this one (although, again, most of the other docs were foro RHEL4.1 or earlier). I'm at a loss as to how to combat this problem, other than what I already tried. Any ideas?