df -h hangs

cpseed · 05-29-2015, 05:15 AM

We have a number of Solaris 10 servers that are paired i.e. 1 middleware server (MW) with 1 database server (DB). The MW server mounts 3 NFS shares exported by the DB server. We have 7 identical pairs of MW/DB servers. Although the databases have different data, 6 pairs are clones on the original 'Master' set.

All the servers are virtuals running under VMware.

What is happening is some (not all) of the MW servers are having a problem with the NFS shares that have been mounted

if you: df -h the command will hang

if you: umount the share then attempt to mount it again you get an error

I changed the NIC driver from E1000 to VMXNET3 and rebooted all 7 pairs of servers, now a different group of servers are experiencing the problem but not all of them.

I have tried using NFSv3 instead of NFSv4 but that lasted 2 days and has failed as well.

So back to square one...

Q) Does anyone know why this is happening?
Q) Does anyone know how to stop it from happening (and I don't mean the df -h hanging)

I have spent several fruitless days on Google researching this problem but none of the answers actually explain what the underlying fault is. Nor do they give an explanation as to how to fix the problem.

Yours very sincerely
Mr Frustrated
Ready To Have A Nervous Breakdown

AlucardZero · 05-29-2015, 09:08 AM

What is the error you get when you attempt to mount it again?
Are the clients using DHCP?
What are the permissions on the share on the NFS server (showmount -e nfsserver)?

cpseed · 06-01-2015, 01:24 AM

1. We don't use DHCP at all - static IP addresses only
2. The shares have been mounted for some days before the problem manifests itself. However, once the problem occurs and the share is umounted and an a re-mount is attempted we get:

NFS compound failed for server DB707: error 5 (RPC: Timed out)

3. The result of showmount:

export list for DB707:
/patch (everyone)
/stage (everyone)
/share (everyone)

The client server (MW707) mounts all 3 shares (/share, /patch and /stage) and all 3 can no longer be listed from the MW server (which includes doing a df -h).

4. In anticipation of queries regarding rpc:

On the NFS Server:
bash-3.2# rpcinfo -p DB707
program vers proto port service
100000 4 tcp 111 rpcbind
100000 3 tcp 111 rpcbind
100000 2 tcp 111 rpcbind
100000 4 udp 111 rpcbind
100000 3 udp 111 rpcbind
100000 2 udp 111 rpcbind
100024 1 udp 32772 status
100024 1 tcp 32771 status
100133 1 udp 32772
100133 1 tcp 32771
1073741824 1 tcp 32772
100021 1 udp 4045 nlockmgr
100021 2 udp 4045 nlockmgr
100021 3 udp 4045 nlockmgr
100021 4 udp 4045 nlockmgr
100021 1 tcp 4045 nlockmgr
100021 2 tcp 4045 nlockmgr
100021 3 tcp 4045 nlockmgr
100021 4 tcp 4045 nlockmgr
100011 1 udp 32773 rquotad
100005 1 udp 32774 mountd
100005 1 tcp 32777 mountd
100005 2 udp 32774 mountd
100005 2 tcp 32777 mountd
100005 3 udp 32774 mountd
100005 3 tcp 32777 mountd
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100227 2 udp 2049 nfs_acl
100227 3 udp 2049 nfs_acl
100003 2 tcp 2049 nfs
100003 3 tcp 2049 nfs
100003 4 tcp 2049 nfs
100227 2 tcp 2049 nfs_acl
100227 3 tcp 2049 nfs_acl

On the Client (MW707):
bash-3.2# rpcinfo -p
program vers proto port service
100000 4 tcp 111 rpcbind
100000 3 tcp 111 rpcbind
100000 2 tcp 111 rpcbind
100000 4 udp 111 rpcbind
100000 3 udp 111 rpcbind
100000 2 udp 111 rpcbind
100024 1 udp 32772 status
100024 1 tcp 32771 status
100133 1 udp 32772
100133 1 tcp 32771
1073741824 1 tcp 32772
100021 1 udp 4045 nlockmgr
100021 2 udp 4045 nlockmgr
100021 3 udp 4045 nlockmgr
100021 4 udp 4045 nlockmgr
100021 1 tcp 4045 nlockmgr
100021 2 tcp 4045 nlockmgr
100021 3 tcp 4045 nlockmgr
100021 4 tcp 4045 nlockmgr
100011 1 udp 32773 rquotad

Regards

jlliagre · 06-01-2015, 04:12 AM

On the NFS client and when the issue happen, what says :

Code:

svcs -xv nfs/client

?

cpseed · 06-01-2015, 04:50 AM

bash-3.2# svcs -xv nfs/client
svc:/network/nfs/client:default (NFS client)
State: online since Tue May 19 14:01:52 2015
See: man -M /usr/share/man -s 1M mount_nfs
See: /var/svc/log/network-nfs-client:default.log
Impact: None.

MadeInGermany · 06-01-2015, 02:50 PM

Could be a problem with tcp; try to mount with options vers=3,proto=udp

cpseed · 06-03-2015, 04:09 AM

I've already tried using NFSv3 and we still get the problem.

The problem with 'udp' is that it's not really suitable for use when read/write operations are in use - it doesn't report errors which can result in a currupt file. 'udp' is best used in read-only 'fire-and-forget' type transactions. However, whilst I was away a colleague tried using 'udp' and it failed as well.

Thanks anyway for the suggestion.

MadeInGermany · 06-06-2015, 10:36 AM

The Sun NFS client has a robust error correction in its application layer (that is used on top of udp).
At least you should temporarily try it, in order to sort out a tcp problem.

Axel van Moorsel · 08-29-2015, 05:53 AM

I had a similar problem mounting shares from a virtual server. Turned out that time on virtual server was 2 seconds behind physical system. After correcting time on server (which also took some time, because it was a virtual one), I was able to mount them again.

cpseed · 09-01-2015, 07:24 AM

The problem affects several servers.

Since the last post I have managed to download the rolled-up zip file of recommended patches for Solaris 10 1/13 u11 and applied to 1 set of servers. The fault seems to have disappeared have applied the patches. I will be rolling out the patches to the other servers over the next week or so. Hopefully, that should be the end of this NFS issue.

Regards,

Paul Seed