named timeouts, just kill/start helps, can't find reason
Hi All!
I use 2 Dell PE650 servers on 2 independent local network as mail/DNS/DHCP servers. OS is RedHat Linux release 9 (Shrike), Linux **** 2.4.20-31.9 #1 bind-9.2.1-16, dhcp-3.0pl1-23, sendmail-8.12.8-9.90 mailman-2.1.1-5 imap-2001a-18 ipop3d It works with one interface, one IP address on it. 60 - 80 users on a server, 60 - 80 client machines on the localnet. They are working without errors for months/years. But there are times, when for some weeks comes a strange error ( not in one time on the 2 servers...): connections timing out, named stops serving, I can ssh in only with very long connection time, clients can not get their mails, network freezes. This time I can't use 'service' command to stop named, so I stop it with kill -9, then I try to restart it. Sometimes have to do repeatedly 2-3 times till named starts to answer normally. There are days when it happens only once, but there are days when it happens 3 - 4 times, half an hour, hours or half days between them. I tried temporarily a script which kills the named then restarts it from crontab if it is cooked, but found crontab doesn't work well when this error occurs. If I try a 'crontab -l' it can't answer to. Maybe some resource problem, but where to search ? I collected datas when it doesn't functioning, but I dont find any reason for it. If anybody have met error like this, please help, I have no more idea. Thanks, Geza - there is nothing strange in the named or other logs - load is 0.5 - 5.0 the upper value is very rare free total used free shared buffers cached Mem: 255252 251704 3548 0 69012 129780 -/+ buffers/cache: 52912 202340 Swap: 2104432 42080 2062352 vmstat procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 42080 4784 64616 131696 0 0 0 0 101 16 0 0 100 0 0 0 42080 4796 64632 131692 0 0 0 41 190 37 0 3 97 0 0 0 42080 4796 64632 131692 0 0 0 0 103 18 0 0 100 sometimes cs can go up to 180-200. how many sockets are in each connection state: netstat -a -n|grep -E "^(tcp)"| cut -c 68-|sort|uniq -c|sort -n 4 LAST_ACK 16 LISTEN 33 ESTABLISHED |
If you lose dns facilities on the server, everything that uses it will slow down waiting for replies, eg mail, ssh & inetd, which rely on reverse dns checks or dns resolution for logins.
I suggest re-installing bind or using a dns proxy rather than a full blown dns server. That kernel version is quite old and there are many well known & published exploits for the kernel and ssh. If you are going to maintain a long term & secure server, I suggest moving to a distro that keeps updates, and keeps versions for a long time so you don't have to keep upgrading every year. |
All times are GMT -5. The time now is 06:25 AM. |