File System Locked Up when NFS mount went offline
Aight, first the basics:
Running RHEL 3 (2.4 Kernel), Dell Poweredge 2850, 4GB RAM, 2TB EMC RAID on ServerA.
Solaris Sun system on ServerB.
I do not control Server B.
ServerB publishes an NFS share that I have mounted on ServerA as /mnt/serverb.
ServerA has the 2TB raid mounted as /mnt/raid
On the raid is a samba shared directory we'll call public. Mounted at /mnt/raid/public
Inside this samba shared directory is a symlink to /mnt/serverb which is contained in /mnt/raid/public/serverb
Yesterday, serverb went into Single User Mode (as it tends to do from time to time for whatever reason) for approximately 45 minutes. 15 minutes into the outage, ServerA noticed that ServerB was missing a threw an error into the system log:
(ServerA kernel: nfs: server {ip address} not responding, still trying)
About this time, the directory /mnt/raid/public stopped responding. Completely. Any attempt to ls on the directory froze, with kill -9 commands being uneffective. Windows clients could not connect to the share at all (except an old Win98 box, but it couldn't list the directory contents). I even attempted to use a Java based file manager to look at the file system, but it locked up as well (webmin). Oddly enough, I could still tabcomplete in the directory with no problems.
No other shares were affected. Server utilization was no stranger than normal. Only this one share/directory seemed to be affected.
When serverb came back up, all of my frozen console sessions began responding again and the Windows clients could connect. The Samba logs show no errors during this time period.
So, the short question is this: Should a symlink be able to bring an entire directory to a standstill?
The longer question is: How do I keep the symlink but prevent being held hostage by serverb's erratic uptime.
Thanks in advance and sorry for the long explanation. I've never seen anything like this before.
--zigg
|