NFS+GFS issue caused by dlm_sendd

PhillipHuang · 10-26-2007, 03:53 AM

Hello all,

My cluster info:
-----------------------
CentOS4.4 as operating system
RedHat Cluster Suite 4.0/GFS
A background storage

After the cluster is created successfully, I try to mount the shares(5TB) building on GFS filesytem across two nodes in another client.
1. when mount the share from node1:

Code:

node1#mount /dev/mapper/test/test /mnt/

It is very quickly, about 1 second.
2. and then mount the share from node2:

Code:

node2#mount /dev/mapper/test/test /mnt/

It is very slowly, and the worst recorder is 20 minutes.Of course, this time must be accepted by customers.

I arbitrarily do the above test, and there's always a node mounted very quicky and the left one is mounted slowly. When checking the slow node by issuing "top" command, it displays the "dlm_sendd" became a grant CPU hog when client is mounting the node's NFS share. I think this process causes the low performance.

I google for the key word "dlm_sendd", somebody said it is a kernel bug that has been fixed, or add numa=off boot strings. However, I do not get successful as following with these hints.

Later, I see the nfsd is waiting for CPU time when dlm_sendd keeps high,then I try to modify NFS to make nfsd running first and faster.
for example, assume node2 is slow one.

Code:

node2# service nfs stop
node2# rpc.nfsd
node2# rpc.mountd
node2# rpc.equotad
node2# exportfs -r

at this moment, client can mount node2's share very quickly. I repeat the mount and umount operation more than 10 times, all the testing are about 3-5 seconds.

It seems to have been resolved this issue. But I'm not sure this resolution is stable and if there will be patient troubles.

How do you think about this issue? and any suggestions?
Thanks in advanced.

Phillip

PhillipHuang · 10-30-2007, 10:38 PM

Update.

David Teigland from RedHat provides the following info.
https://www.redhat.com/archives/linu.../msg00268.html

I'll verify it on 2.6.23-rc.

-Phillip