This is a SOLUTION not a question.
BACKGROUND: We are using samba3-3.0.31 (from ftp://ftp.sernet.de
) in combination with nss_ldap and nss_updatedb from PADL (http://padl.com
) over a Centos 5.2 OS (running on a Dell PowerEdge 1950 server (for completeness), though I believe that to be irrelevant) .
The samba server is a member server of a Windows 2003 ActiveDirectory environment. nss_ldap and nss_updatedb allow the system to directly query the ActiveDirectory LDAP tree for Linux/Unix user and group attributes, and for the LDAP data to be synchronised to a local database, respectively. Our ActiveDirectory LDAP schema was updated with Microsoft's Services for Unix (sfu). nss_updatedb is run every 10 minutes to ensure that newly created/modified user/group data is synchronised to the samba server.
PROBLEM: Many, many samba daemons were caught in CLOSE_WAIT. (netstat -tap | grep CLOSE_WAIT). All were talking ldap (port 389) to one of our Windows Domain Controllers. The large number of stalled smbd processes were consuming lots of system resources.
I had searched the web high and low to find a solution to the problem. There seemed to be quite a few people asking the question but very little in the way of help in solving it.
Henceforth, I offer this post.
My BIG error was assuming that the problem was with Samba. It was not. The problem was with nsswitch.conf and I feel rather belittled by my failure to investigate the problem with a WIDE enough field of vision and thus solve it much more quickly. But, on the positive side, I am offering this post in the hope that it will help others.
The relavant fields of our nsswitch.conf looked like:
passwd: files db [NOTFOUND=return] ldap
group: files db [NOTFOUND=return] ldap
services: files ldap
protocols: files ldap
The first two entries instruct the system to use the data locally synchronised by the nss_updatedb utility and finally to fallback on ldap (via nss_ldap).
The second two entries were really silly. This was causing things like httpd, ntdp and even rpc.statd to get caught in CLOSE_WAIT. It was these non-samba processes caught in CLOSE_WAIT that prompted me to stop looking at samba and start looking at the system configuration.
The problem was resolved by removing ldap from nsswitch.conf. Thus, the above entries became:
passwd: files db
group: files db
Before the change, at any point in time between 1/3 and 2/3 of our samba processes were caught in CLOSE_WAIT talking ldap (port 389) with a domain controller. After the change I have seen none, zero, 0!! This is from logging of total number of processes and processes caught in CLOSE_WAIT of both smbd and winbindd daemons, every 20 minutes for more than 18 hours, with the last 4 hours having an average number of smbd processes of 30.
Previous to the change I would have expect to see in excess of 20 CLOSE_WAIT processes when there were 30 in total. Now we have none. :-)
We have the occasional winbindd in CLOSE_WAIT with a DC, but only one at a time, and the number returns to zero after a while. I'll try to fix that too, but its not a show stopper.
One final note. I used 'deadtime = 1' (as advised by some) to reduce the severity of the symptoms (number of CLOSE_WAIT smbd processes). But this does not address the cause. Fixing the cause (nsswitch.conf) solved the problem.
Please note that this solution is dependent upon the use of nss_updatedb from PADL and the MS Services for Unix (or an equivalent mechanism of synchronising the ActiveDirectory LDAP data to the samba server).
Firstly, an apology to the Samba team from me for assuming that the problem was with Samba.
Secondly, I hope that this 'solution' helps some others.
Here are the relevant global entries from our smb.conf:
realm = OUR.REALM.FOO
workgroup = our
security = ADS
idmap backend = ad
winbind nss info = sfu
winbind use default domain = yes
password server = *
use kerberos keytab = yes
deadtime = 30
PS: I will watch this thread and respond with help where I can for others with a related problem.
PPS: Please excuse me whilst I over-tag this post:
samba samba3 smbd winbindd nsswitch.conf netstat deadtime stalled ldap AD DC ActiveDirectory PADL nss_ldap nss_updatedb SFU CLOSE_WAIT CLOSE_WAIT CLOSE_WAIT CLOSE_WAIT :-D