Dovecot/postfix slows machine to a crawl, takes 30 minutes to reboot...

ethonbridges · 12-13-2017, 11:50 PM

I have been running a postfix/dovecot mail server for about 5 years now, generally with no issues. The server is located in a data center about half an hour away from my office, so everything is managed remotely.

As of the last couple of months, I am getting user calls about once every 2 or 3 days stating that the mail server is not responding. Logging into the machine via SSH is slow as molasses, and a shutdown -r now command takes about half an hour to reboot the machine. Once it reboots, it's pretty speedy again. In looking at a top command, it appears that the IMAP process is usually the one taking the bulk of the CPU, so it's probably more likely a dovecot problem than postfix.

I have about 100 email accounts (including my own), but a pretty large maildir of about 100G between all the users.

I don't see any evidence of maildir corruption or loss of data, just the slowdown every so often. I'm trying to determine if I actually have something wrong with a server that has been relatively trouble free since I built it, or if it's simply a case of overloading the machine as it has grown over the years.

So I guess my questions are:

1. How can I determine if the slowdown is the result of something malicious?
2. Can the user's maildir's be checked for corruption?
3. How can I determine if it's the usage?
4. Is there some clean-up or maintenance process in dovecot or postfix that might be running that hogs the machine?

I'm not a total Linux noob, but in this particular area, I'm not sure where to begin to troubleshoot something like this.

Ethon

descendant_command · 12-14-2017, 12:08 AM

No 'usual suspect' springs to mind.

Probably closely inspect the logs around the 'slowdown' events.

Maybe crank up the logging verbosity and run some cronjob to dump top|netstat|iotop|lsof|doveadm etc output periodically to try and catch whatever's going on.

No funny dmesg output that might indicate kernel oopses or running out of resources etc?

Do you have any timebased resource monitoring like munin or such, on it?

ethonbridges · 12-14-2017, 12:23 AM

Right after I posted the message, it started doing it again. Started seeing:

Dec 14 00:19:22 mail dovecot: master: Error: service(auth-worker): Initial status notification not received in 30 seconds, killing the process
Dec 14 00:19:22 mail dovecot: master: Error: service(auth-worker): kill(32177, SIGKILL) failed: Permission denied
Dec 14 00:20:04 mail dovecot: imap: Error: Internal auth failure (client-pid=32175 client-id=1)
Dec 14 00:20:05 mail dovecot: master: Error: service(ssl-params): Initial status notification not received in 30 seconds, killing the process
Dec 14 00:20:05 mail dovecot: master: Error: service(ssl-params): child 32182 killed with signal 9
Dec 14 00:20:05 mail dovecot: master: Error: service(ssl-params): command startup failed, throttling
Dec 14 00:20:05 mail dovecot: imap-login: Fatal: Corrupted SSL ssl-parameters.dat in state_dir: Truncated file

Quote:

Do you have any timebased resource monitoring like munin or such, on it?

I don't know what that is or how to use it. I'll have to research it further.

Incidentally, restarting postfix/dovecot has no effect on the issue. Once it has started to crawl, it's slow. That would seem to indicate a system problem to me...

Ethon

descendant_command · 12-14-2017, 12:49 AM

Any relevant updates recently?
What OS?
What hardware? (real or virtual?)
Disk space?

ethonbridges · 12-14-2017, 01:06 AM

No updates that I am aware of.

CentOS 6.9

Real. Dedicated only running postfix and dovecot. Intel Celeron(R) CPU 420 1.6Ghz 1 Core

500G drive with 380G free. 1G RAM.

descendant_command · 12-14-2017, 01:44 AM

RAM maybe a little low if load is high - do you have appropriate swap available?
(although dovecot is generally pretty good and not known as a memory hog).

Does 'free' show any clues?

Maybe bad hardware - failing disk or RAM?

If no config changes or software updates between working and acting up, then failing hardware is a prime suspect.

smartctl report any disk warnings?

I'd maybe take it down for a bit to run an fsck and a memtest for starters.

ethonbridges · 12-14-2017, 02:06 AM

I noticed that the RAM seems to be maxed out during the slow times, so it's probably swapping.

No smart warnings on the drive.

Since it's relatively cheap and I can always use RAM in other machines, I'm going to bump it up to 8GB (the motherboard's max) tomorrow. Have also ordered a Xeon processor to replace the Celeron, will change that out when it arrives.

New memory should be telling..

Ethon

descendant_command · 12-14-2017, 02:17 AM

Swapping will be slower but shouldn't cause the process timeouts under normal circumstances, unless your swap maxes out too - how much do you have assigned?

It's easy to add some, to test or buy you some time.
https://www.cyberciti.biz/faq/linux-...ap-file-howto/

ethonbridges · 12-29-2017, 12:41 PM

Upgraded the processor to a Xeon, maxed out the RAM at 8GB. No problems since and I've thrown everything I can at it, doesn't bog down any more.