LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   /etc/init.d network dependent scripts running/failing at boot [on ubuntu server] (https://www.linuxquestions.org/questions/linux-newbie-8/etc-init-d-network-dependent-scripts-running-failing-at-boot-%5Bon-ubuntu-server%5D-793490/)

pimanlives 03-05-2010 03:52 PM

/etc/init.d network dependent scripts running/failing at boot [on ubuntu server]
 
First a brief description...
I have been slowly building and configuring a small computer cluster to run chemistry simulations on for my research. As problems arise, I am typically able to find a solution through google and/or these forums. However this problem is one in which I have yet to find anything useful (although I may not understand what I should be searching for)

I have somewhat successfully installed ganglia (a cluster monitoring package), torque (pbs batch system), and maui (scheduler that can integrate with torque). However their init scripts do not always run correctly on a reboot.

Looking at the error messages in /var/log/daemon.log after a reboot I sometimes see the following.

Quote:

Mar 4 13:29:56 lithium /usr/sbin/gmond[924]: Error creating multicast server mcast_join=239.2.11.71 port=8649 mcast_if=NULL family='inet4'. Exiting.#012
Mar 4 13:29:57 lithium pbs_mom: LOG_ERROR::mom_server_add, host mendeleev.chemcluster.loc not found
Followed closely by
Quote:

Mar 4 13:30:04 lithium dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 14
Mar 4 13:30:04 lithium dhclient: DHCPOFFER of 192.168.100.4 from 192.168.100.1
Mar 4 13:30:04 lithium dhclient: DHCPREQUEST of 192.168.100.4 on eth0 to 255.255.255.255 port 67
Mar 4 13:30:04 lithium dhclient: DHCPACK of 192.168.100.4 from 192.168.100.1
Mar 4 13:30:04 lithium dhclient: bound to 192.168.100.4 -- renewal in 35587 seconds.
The scripts are supposed to be dependent on networking but they do not appear to be.
Quote:

# Required-Start: $network $named $remote_fs $syslog
To give a brief description of my cluster topology
I have a head node, mendeleev, acting as the dhcp, dns server, etc. While every node is essentially identity-less. They are told who they are based on information stored in the head node's dhcpd.conf file. I think the problem is that these scripts very much depend on not only networking being up (which seems to be a bit of a nebulous concept) but that the node having already been assigned a name, IP address, etc. Because if I run the script later, everything works fine. Additionally, I have tried a very simple hack of adding a 5 second sleep command to both of these init scripts (torque and ganglia) and that also seems to work.

I guess my question is, does this seem like a valid conclusion to draw, and what is the proper way to solve this type of init script network dependency problem? I feel a bit unclean throwing a sleep statement into an init script like this :) and I really thought that #Required-Start: $network, should have prevented these problems.

My apologies if this feels long or unwieldy, I was attempting to accurately describe my problem and attempts and it seems to have gotten a bit long.

Joseph Michalka

jstephens84 03-05-2010 04:29 PM

What happens if you restart the scripts once you get a new IP? From my quick observations, it looks like the services go dead after your adapter fails.

pimanlives 03-05-2010 04:32 PM

If I restart the scripts after the node has come completely up (it both sees and is seen by the network) they run correctly. I only have this problem occur on boot/reboot

catkin 03-05-2010 10:35 PM

Quote:

Originally Posted by pimanlives (Post 3887487)
... what is the proper way to solve this type of init script network dependency problem? I feel a bit unclean throwing a sleep statement into an init script like this :) and I really thought that #Required-Start: $network, should have prevented these problems.

AFAIK there is no "proper way".

As you comment, "network up" is a nebulous concept so there's no test for it. It's a recurring problem when doing NFS mounts at boot time. The usual (imperfect) solution is to run the network dependent boot script late in the boot sequence by giving them links beginning S<big number> where <big number> is up to 99 depending on which other boot scripts have to run after this one.

If robustness is more important to you than boot time then you could devise a "network up" test and run it in a delayed loop (with a maximum limit so it doesn't sit there forever) at the beginning of your script.

grail 03-06-2010 02:01 AM

Hi pimanlives

Are your startup scripts upstart, init or some other version?

Assuming upstart, may be as simple as changing current 'start on' to be:

start on started networking (may need to check which script starts your cards up and put its name here)

pimanlives 03-15-2010 07:59 PM

Further forays
 
catkin

I appreciate the confirmation that my assumption about "networking" not being fully up is the probable cause of my problem.

grail
The scripts were initially init.d type scripts and I was trying to use the directive
Code:

# Required-Start:    $network $named $remote_fs $syslog
to force them to wait for networking to be up. I did not really know what upstart was before you posted, but after looking into it and trying to find some examples I ran across another solution that appears to be working and seems to make sense for my current system setup.

I found out about the folder /etc/network/if-up.d. And how the scripts present in this directory are run whenever an network device is fully up. I ended up writing a simple script in this folder that calls the init.d scripts I already have. I then "update-rc.d -f $name remove" the scripts from the normal boot sequence. As catkin recommended, and I was trying to fake with a sleep command, these scripts should now only run when a network interface is "up".
Code:

#!/bin/sh
/etc/init.d/ganglia-monitor restart
/etc/init.d/pbs_mom restart

Thank you for the suggestions all of you, I am glad my first question on these forums has been resolved so quickly.


All times are GMT -5. The time now is 01:02 PM.