First a brief description...
I have been slowly building and configuring a small computer cluster to run chemistry simulations on for my research. As problems arise, I am typically able to find a solution through google and/or these forums. However this problem is one in which I have yet to find anything useful (although I may not understand what I should be searching for)
I have somewhat successfully installed ganglia (a cluster monitoring package), torque (pbs batch system), and maui (scheduler that can integrate with torque). However their init scripts do not always run correctly on a reboot.
Looking at the error messages in /var/log/daemon.log after a reboot I sometimes see the following.
Mar 4 13:29:56 lithium /usr/sbin/gmond: Error creating multicast server mcast_join=18.104.22.168 port=8649 mcast_if=NULL family='inet4'. Exiting.#012
Mar 4 13:29:57 lithium pbs_mom: LOG_ERROR::mom_server_add, host mendeleev.chemcluster.loc not found
Followed closely by
Mar 4 13:30:04 lithium dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 14
Mar 4 13:30:04 lithium dhclient: DHCPOFFER of 192.168.100.4 from 192.168.100.1
Mar 4 13:30:04 lithium dhclient: DHCPREQUEST of 192.168.100.4 on eth0 to 255.255.255.255 port 67
Mar 4 13:30:04 lithium dhclient: DHCPACK of 192.168.100.4 from 192.168.100.1
Mar 4 13:30:04 lithium dhclient: bound to 192.168.100.4 -- renewal in 35587 seconds.
The scripts are supposed to be dependent on networking but they do not appear to be.
# Required-Start: $network $named $remote_fs $syslog
To give a brief description of my cluster topology
I have a head node, mendeleev, acting as the dhcp, dns server, etc. While every node is essentially identity-less. They are told who they are based on information stored in the head node's dhcpd.conf file. I think the problem is that these scripts very much depend on not only networking being up (which seems to be a bit of a nebulous concept) but that the node having already been assigned a name, IP address, etc. Because if I run the script later, everything works fine. Additionally, I have tried a very simple hack of adding a 5 second sleep command to both of these init scripts (torque and ganglia) and that also seems to work.
I guess my question is, does this seem like a valid conclusion to draw, and what is the proper way to solve this type of init script network dependency problem? I feel a bit unclean throwing a sleep statement into an init script like this
and I really thought that #Required-Start: $network, should have prevented these problems.
My apologies if this feels long or unwieldy, I was attempting to accurately describe my problem and attempts and it seems to have gotten a bit long.