Hi!
I have a small cluster (OSCAR, Fedora 8) and I was able to run some application software on it. Then lightning struck very close to the building. Fortunately I had unplugged all the power cables (because the cluster has not yet been moved to where the power lines are protected), but it seems that the institution didn't have any protection on their LAN cables, and so the whole building's public network cards are damaged. A costly lesson.
Anyway, when I tried to run the application software in parallel across the cluster (using the private network which is unscathed) I get the error message given in the subject line. I contacted the application software's help department as I thought I had perhaps forgotten to set something, but according to them it is a normal network problem. They gave some suggestions as to what the problem may be, but I have checked it and it doesn't cure the problem.
I have included it here so that you don't waste time by suggesting the same things.
Quote:
Check the /etc/hosts file and make sure that the nodes all have a
single definition and you don't have lines like
127.0.0.1 localhost normnode3
and that normnode3 has the same address both on the master and on the
node.
You can try
ping normnode3
from the master and see what address comes back
64 bytes from 164.190.57.105: icmp_seq=1 ttl=64 time=0.306 ms
or is it 127.0.0.1. Then do the reverse.
Also double check that you can ssh between nodes without password
but I would expect a different error then.
|
The command "hostname" returns gnlserv01, which is the public NIC.
After the lightning I had trouble getting the nodes to communicate "automatically" with each other, but it can be cured by starting the xinetd service and disabling the firewall on the master node (it's not too dangerous since I don't have a public interface at present and since I'm sitting behind the institution's firewall as well.) Just by the way, I would think that ther should be a file somewhere in which I could specify those two commands to take place when the master node is switched on. Could you perhaps enlighten me as to where and how I could specify it?
I was wondering whether I would need to explicitly start a bind-type service or something like that? (Since I had to explicitly start xinetd)
I'm rather clueless really. I googled around and found that there is a named service, so I tried to start it, but I don't think it's installed on the computer. Therefore, since I have managed to run the application software in parallel previously, the named service is probably not the problem. Hm?
Here is a copy of how my /etc/hosts file looks like:
Code:
# Do not remove the following line, or various programs
# that require network functionality will fail.
# These entries are managed by SIS, please don't modify them.
127.0.0.1 localhost.localdomain localhost
192.168.1.254 snode0.oscardomain.za snode0 oscar_server nfs_oscar pbs_oscar
abc.xyz.104.218 gnlserv01.ab.cx.yz gnlserv01
192.168.1.1 normnode1.ab.cx.yz normnode1
192.168.1.2 normnode2.ab.cx.yz normnode2
192.168.1.3 normnode3.ab.cx.yz normnode3
Here is the output of ifconfig -a:
Code:
[compchem@gnlserv01 /root]$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:1C:C0:AF:10:18
inet addr:192.168.1.254 Bcast:192.168.255.255 Mask:255.255.0.0
inet6 addr: fe80::21c:c0ff:feaf:1018/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2587 errors:0 dropped:0 overruns:0 frame:0
TX packets:3109 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:332943 (325.1 KiB) TX bytes:409521 (399.9 KiB)
Base address:0x20c0 Memory:e0300000-e0320000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:5009 errors:0 dropped:0 overruns:0 frame:0
TX packets:5009 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3813184 (3.6 MiB) TX bytes:3813184 (3.6 MiB)
I would really appreciate suggestions and comments!
;-)