LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Networking
User Name
Password
Linux - Networking This forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.

Notices

Reply
 
Search this Thread
Old 02-21-2007, 05:33 PM   #1
Baerek
LQ Newbie
 
Registered: Feb 2007
Distribution: Mandriva
Posts: 5

Rep: Reputation: 0
Linux cluster - master node can't connect to slave nodes anymore


Greetings!
So I have a little linux cluster of 8 computers in total (1 master node, 7 slave nodes). The way it's set up is that the master node has 3 ethernet cards (1 for accessing the outside world thus bypassing the router, 1 for accessing the LAN, and the last connects to an 8-port switch with the 7 slave nodes). Anyway everything was working just great until last weekend when I decided to change the outside world connection from DHCP to static so I could bypass the router and thus not have to worry about port forwarding. This actually worked great and I was excited to see it working again (a couple of months ago this is how it was set up but after moving the cluster across the building I switched it to DHCP). A couple of hours later I noticed that the slave nodes could no longer contact the master node. I'm not sure if this was related to changing from DHCP to static or if someone else had SSH'd into the cluster and changed settings (there is one other person I know of who could have done this).
So I've been racking my brain the past few days trying to figure out what exactly happened so any help would be very much appreciated.
Here are my symptoms:
The master node can ping itself but none of the slave nodes can ping it. The master node also still has internet access and can access the LAN (so the other 2 ethernet cards seem to be working perfectly fine). The lights on the actual ethernet card are blinking and turn off when I unplug the cable from the switch, which to me indicates it's working.
Here's the kicker, all 7 slave nodes can ping each other but none can ping the master node. I've tried changing which ports are being used by the switch but regardless of which cable is plugged into which port, the symptoms never change.
I tried switching it back to DHCP and of course it still didn't work, which indicates that maybe another setting was changed? Also when I switched from DHCP to static, the prompt changed from:
"airlinux:~/Desktop" to "master:~/Desktop" and visa versa.
Anyway I'm not the person who set this thing up but somehow I'm in charge of making sure it keeps working (lucky me ) so i was hoping someone could give me a little more direction as to what I can check or how to debug this thing.

Thanks!
-Baerek
 
Old 02-21-2007, 08:35 PM   #2
Micro420
Senior Member
 
Registered: Aug 2003
Location: Berkeley, CA
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986

Rep: Reputation: 45
Code:
route
copy/paste

Code:
ifconfig
copy/paste

While you're at it, do a
Code:
tracepath
copy/paste

Also need to see the configurations for one of the slaves. Same thing
Code:
route
and
Code:
ifconfig
Did you configure your NIC's through the MCC (assuming you are using Mandriva)?!???!? I just don't trust Mandriva ...

Last edited by Micro420; 02-21-2007 at 08:40 PM.
 
Old 02-23-2007, 11:48 AM   #3
Baerek
LQ Newbie
 
Registered: Feb 2007
Distribution: Mandriva
Posts: 5

Original Poster
Rep: Reputation: 0
Wow! Thanks for the quick response!
Anyway here ya go. Since eth5 is my connection to the outside world (which I SSH into) I replaced the IP address with **.***.***. These numbers are constant throughout this entire post.

For the master node:

Code:
route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
**.***.***.128  *               255.255.255.224 U     10     0        0 eth5
192.168.1.0     *               255.255.255.0   U     10     0        0 eth1
192.168.10.0    *               255.255.255.0   U     10     0        0 eth0
default         **.***.***.129. 0.0.0.0         UG    10     0        0 eth5
Code:
master:/sbin ) --> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:13:D3:E4:DF:44
          inet addr:192.168.10.100  Bcast:192.168.10.255  Mask:255.255.255.0
          inet6 addr: fe80::213:d3ff:fee4:df44/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:139385 errors:17 dropped:0 overruns:0 frame:17
          TX packets:0 errors:0 dropped:7041 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:9046168 (8.6 MiB)  TX bytes:0 (0.0 b)
          Interrupt:217 Base address:0xa000

eth1      Link encap:Ethernet  HWaddr 00:04:E2:FC:F0:E8
          inet addr:192.168.1.60  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::204:e2ff:fefc:f0e8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:579785 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:68846018 (65.6 MiB)  TX bytes:1056 (1.0 KiB)
          Interrupt:74 Memory:fcff8000-0

eth5      Link encap:Ethernet  HWaddr 00:13:D3:E4:E0:E3
          inet addr:**.***.***.150  Bcast:**.***.***.159  Mask:255.255.255.224
          inet6 addr: fe80::213:d3ff:fee4:e0e3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:484193 errors:0 dropped:0 overruns:0 frame:0
          TX packets:392 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:611732180 (583.3 MiB)  TX bytes:152168 (148.6 KiB)
          Interrupt:74 Memory:fddfc000-0

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1297 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1297 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:117484 (114.7 KiB)  TX bytes:117484 (114.7 KiB)
And for the slave node:
Code:
Route:
Kernel IP routing table
Destination           Gateway         Genmask         Flags Metric Ref    Use  Iface
192.168.10.0             *               255.255.255.0      U       0        0      0      eth1
Code:
                                        
eth0	Link encap: Ethernet HWaddr 00:16:17:1A:5E:9F
	inet6 addr: fe80::216:17ff:fe1a:5e9f/64 Scope:Link
	UP BROADCAST MULTICAST	MTU:1500   Metric:1
	RX packets:0  errors:0  dropped:0 overruns:0 frame:0
	TX packets:0  errors:0  dropped:0 overruns:0 carriers:0
	collisions:0  txqueuelen:1000
	RX bytes:0 (0.0 b)   TX bytes:0 (0.0 b)
	Interrupt:225  Base address:0x8000

eth1	Link encap:Ethernet   HWaddr  00:16:17:1A:5E:A0
	inet addr:192.168.10.102  Bcast:192.168.10.255  Mask:255.255.255.0
	inet6 addr: fe80:216:17ff:fe1a:5ea0/64  Scope:Link
	UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
	RX packets:142813 errors:0  dropped:0 overruns:0 frame:0
	TX packets:9752  errors:0 dropped:0  overruns:0  carrier:0
	collisions:0  txqueuelen:1000
	RX bytes:9740387 (9.2 MiB)  TX bytes: 693044 (676.8 KiB)
	Interrupt:217 memory:fddfc000-0

eth2	Link encap:UNSPEC  HWaddr 00-10-DC-00-00-CE-02-E9-00-00-00-00-00-00-00-00
	UP BROADCAST RUNNING MULTICAST MTU:1500  Metric:1
	RX packets: 0  errors:0  dropped:0 overruns:0 frame:0
	TX packets:0   errors:6  dropped:6 overruns:0 carrier:0
	collisions:0   txqueuelen:1000
	RX bytes:0 (0.0 b)	TX bytes:0 (0.0 b)

lo	Link encap:Local Loopback
	inet addr:127.0.0.1 Mask:255.0.0.0
	inet6 addr: ::1/128 Scope:Host
	UP LOOPBACK RUNNING MTU:16436 Metric:1
	RX packets:2748  errors:0  dropped:0  overruns:0  frame:0
	TX packets:2748  errors:0  dropped:0  overruns:0  carrier:0
	collisions:0  txqueuelen:0
	RX bytes:269508 (263.1 KiB)	TX bytes: 269508 (263.1 KiB)
Also I've never used tracepath before so I wasn't sure what would be most useful for the arguments. Here's the results of a tracepath from the master node to the slave node:
Code:
[root@master usr]# tracepath 192.168.10.102
 1:  node0.localdomain (192.168.10.100)                     0.129ms pmtu 1500
 1:  no reply
 1:  node0.localdomain (192.168.10.100)                   2000.818ms !H
     Resume: pmtu 1500
And yep I am using Mandriva and the NICs were indeed configured via the MCC.
 
Old 02-28-2007, 02:05 PM   #4
Baerek
LQ Newbie
 
Registered: Feb 2007
Distribution: Mandriva
Posts: 5

Original Poster
Rep: Reputation: 0
<bump>

I'm still a little clueless as to what the problem may be. Does anyone else have any words of wisdom to share? Or any thoughts on how to begin troubleshooting this thing?
 
Old 02-28-2007, 11:05 PM   #5
Micro420
Senior Member
 
Registered: Aug 2003
Location: Berkeley, CA
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986

Rep: Reputation: 45
I see the problem (I think)

Code:
Route:
Kernel IP routing table
Destination           Gateway         Genmask         Flags Metric Ref    Use  Iface
192.168.10.0             *               255.255.255.0      U       0        0      0      eth1
Your slave node has no contact with the master. You need to add a default gateway to the master on eth1

Lets see if my memory serves me correct:
On your slave nodes (as root)
Code:
route add default gw 192.168.10.100/24
Now on the slave node, try:
Code:
ping 192.168.10.100
They should then be able to hit the master and get a response.

Last edited by Micro420; 02-28-2007 at 11:10 PM.
 
Old 03-02-2007, 02:40 PM   #6
Baerek
LQ Newbie
 
Registered: Feb 2007
Distribution: Mandriva
Posts: 5

Original Poster
Rep: Reputation: 0
Hey, thanks again for responding!
I tried your command:
Code:
route add default gw 192.168.10.100/24
and received
Quote:
192.168.10.100/24: Host name lookup failure
I was wondering what the /24 does to the command? If I leave the /24 off it adds itself to the routing table but it still cannot connect.

Also today I remembered that I had made a backup of the master node's hard drive about a month ago and I believe it was working at that time. When I plugged it all in, I found that I only had access to the LAN (not to the internet or to the slave nodes). I haven't checked it too extensively but I checked the routing table and found that it was listed as:

Code:
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
**.***.***.128  *               255.255.255.224 U     10     0        0 eth5
192.168.1.0     *               255.255.255.0   U     10     0        0 eth1
192.168.10.0    *               255.255.255.0   U     10     0        0 eth0
default         *               0.0.0.0         U     10     0        0 eth0
At the moment the current routing table for the current hard drive is listed as:

Code:
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
**.***.***.128  *               255.255.255.224 U     10     0        0 eth5
192.168.1.0     *               255.255.255.0   U     10     0        0 eth1
192.168.10.0    *               255.255.255.0   U     10     0        0 eth0
192.168.10.0    *               255.255.255.0   U     10     0        0 eth3
default         **.***.***.129. 0.0.0.0         UG    10     0        0 eth5
Is there any reason why the default would have been listed as eth0 before and eth5 now? Also the gateway was originally listed as * instead of an IP address for the gateway, is that potentially important?
I'm not sure why eth3 now exists in the list of interfaces, since to my knowledge there is no ethernet card set to eth3, but I've tried removing it and it added itself again. Also I've switched the default gateway to match the one of the previous hard drive and after reconnecting the ethernet cards it added itself back as the **.***.***.129 address. Anyway I think you're onto something with it potentially being the routing table but I'm guessing it has to do with the master node's routing table.

Thanks again!
Baerek
 
Old 03-30-2007, 02:02 PM   #7
Baerek
LQ Newbie
 
Registered: Feb 2007
Distribution: Mandriva
Posts: 5

Original Poster
Rep: Reputation: 0
*bump* me again . I'm still having the same problems.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: Installing Nodes in a Large Linux Cluster LXer Syndicated Linux News 0 02-03-2007 10:33 AM
SSH from master node to compute nodes w/out passwd haiders Linux - Networking 2 09-20-2006 11:50 AM
linux cluster: slave nodes keep switching ip addresses frankie_DJ Linux - Networking 0 08-30-2006 02:48 PM
3 node Apache WebServer Linux Cluster youngclusterman Linux - Networking 12 01-29-2004 12:13 PM
rsh between 2 nodes on a Linux Cluster marxsmann Linux - General 2 01-26-2003 04:15 PM


All times are GMT -5. The time now is 08:25 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration