Automount fails and doesn't retry
Our system contains about 10 disk servers and about 20 compute nodes. We use
NIS with automount to configure disk sharing. The system works fine except
when the load on a disk server is high. When this is the case, it is possible
for a mount request (from automount on a compute node) to time out. Automount
reports in /var/log/messages that the "mount failed".
The problem is that the process that requested the disk to be mounted dies
as it doesn't have the data it requires to run. We use torque as a batch system
for production jobs, so when a job dies, torque sends the next job in the queue
to the compute node, and it promptly dies. The process goes on and on until
all the jobs waiting on the queue have been submitted and have died.
The problem with automount is 2 fold:
1) Under high load, when a mount request times out, automount does not
resubmit the request. There does not seem to be a way to lengthen the
timeout or increase the number of attempts. Note, this is not the idle time
unmount "--imeout" that I am talking about.
2) If automount fails to mount a disk, subsequent attempts to mount the disk
fail instantly. This is big time bad news for me, because it causes my
jobs to die by the hundreds when one compute node does south.
OS: RHEL-3
Any Ideas -Andy
|