LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (http://www.linuxquestions.org/questions/linux-server-73/)
-   -   Postfix / cron / monit problem on CentOS (http://www.linuxquestions.org/questions/linux-server-73/postfix-cron-monit-problem-on-centos-869361/)

Gethyn 03-18-2011 05:05 AM

Postfix / cron / monit problem on CentOS
 
Hi all,

I've run into a strange problem on one of the servers I maintain. It's running CentOS 5, fully up to date. The server has Monit installed via RPMforge, and runs Postfix as its MTA. Every day at about 4:20am, when cron is running, Monit loses its connection to Postfix. As far as I can tell, nothing is actually going wrong with Postfix and it's always restarted just fine. I run a couple of Debian servers with very similar setups, and the problem does not occur on them.

Perhaps related, other daemons on the CentOS box often lose their pid files at the time when cron runs, which results in Monit thinking they're not running. One service in particular (zarafa-dagent) is so bad for this that I've had to completely disable monitoring of it in Monit. Again, this problem doesn't occur on the Debian boxes I maintain.

The problem isn't hugely urgent as everything still runs fine, but it does mean that pretty much every day I get false positive mail alerts from Monit, which is (a) annoying and (b) makes me more likely to miss a real alert.

Any light anyone can shed on this would be greatly appreciated! Thanks in advance.

unSpawn 03-19-2011 05:48 AM

Have a look in /var/log/cron at what cronjobs (logrotate?) are run by which users at what time?

Gethyn 03-23-2011 09:12 AM

Last night the connections failed at 4:10:30 and 4:21:58. Here's the cron log from 4:00 to 4:30...
Code:

Mar 23 04:00:01 localhost crond[11485]: (root) CMD (/usr/lib/sa/sa1 1 1)
Mar 23 04:01:01 localhost crond[11505]: (root) CMD (run-parts /etc/cron.hourly)
Mar 23 04:02:01 localhost crond[11540]: (root) CMD (run-parts /etc/cron.daily)
Mar 23 04:02:01 localhost anacron[11544]: Updated timestamp for job `cron.daily' to 2011-03-23
Mar 23 04:02:12 localhost /usr/bin/crontab[11980]: (root) LIST (nobody)
Mar 23 04:03:01 localhost crond[12645]: (root) CMD (ntpdate ntp0.linx.net > /dev/null)
Mar 23 04:10:02 localhost crond[12807]: (root) CMD (/usr/lib/sa/sa1 1 1)
Mar 23 04:20:01 localhost crond[12988]: (root) CMD (/usr/lib/sa/sa1 1 1)
Mar 23 04:30:01 localhost crond[13611]: (root) CMD (/usr/lib/sa/sa1 1 1)

There is nothing in cron.hourly and according to crontab -l, both root and nobody have empty user crontabs. The contents of cron.daily are:
Code:

0anacron
0logwatch
amavisd
chkrootkit
clamdscan-samba-shares
cups
freshclam
logrotate
makewhatis.cron
mlocate.cron
mysql-backup
prelink
rkhunter
rpm
tmpwatch
train-spamassassin

Of these, chkrootkit and rkhunter are custom scripts based on their Debian equivalents, clamdscan-samba-shares is a custom script, mysql-backups is a shim that calls automysqlbackup and train-spamassassin is based on this sa-learn script, modified to work with amavisd.

I guess I'll have to figure this out the hard way, disabling each custom script one at a time and seeing if they affect the situation. Since it's an intermittent problem this is going to take a long time though. I was hoping someone might have seen a similar problem in the past and have some specific comment... (still hoping, any further suggestions would still be appreciated!)

unSpawn 03-23-2011 11:23 AM

Is there any other system or daemon log that shows activity (or better: errors) between 4:00 and 4:30? How about running monit in verbose ("-v") mode?

Gethyn 03-23-2011 12:57 PM

There are a couple of interesting log snippets:
Code:

/var/log/clamav/clamd.log
...
Wed Mar 23 04:07:15 2011 -> WARNING: Directory recursion limit reached, skipping XXXXX
Wed Mar 23 04:08:26 2011 -> XXXXX: Can't open file or directory ERROR
Wed Mar 23 04:11:34 2011 -> XXXXX: Can't open file or directory ERROR
Wed Mar 23 04:14:35 2011 -> SelfCheck: Database status OK.
Wed Mar 23 04:15:14 2011 -> XXXXX: Can't open file or directory ERROR
Wed Mar 23 04:15:14 2011 -> XXXXX: Can't open file or directory ERROR
Wed Mar 23 04:20:58 2011 -> Reading databases from /var/clamav
Wed Mar 23 04:21:01 2011 -> Database correctly reloaded (921596 signatures)
...

I've edited out the actual file names for privacy reasons, but they're all user data rather than system files. The "Can't open file or directory" ones are files over ClamAV's size limit (>2GB). Unfortunately it doesn't seem be possible to get ClamAV to quietly ignore oversize files. The ClamAV database is getting rebuilt around this time too.

Here are the relevant bits of the maillog (apologies for length):
Code:

/var/log/maillog
...
Mar 23 04:08:07 localhost postfix/smtpd[10498]: connect from 141-42.202-68.tampabay.res.rr.com[68.202.42.141]
Mar 23 04:08:19 localhost postfix/trivial-rewrite[12713]: warning: dict_ldap_lookup: Search error -5: Timed out
Mar 23 04:08:19 localhost postfix/trivial-rewrite[12713]: fatal: ldap:ldap_virtual_alias(0,lock|fold_fix): table lookup problem
Mar 23 04:08:22 localhost postfix/smtpd[10498]: warning: premature end-of-input on private/rewrite socket while reading input attribute name
Mar 23 04:08:22 localhost postfix/smtpd[10498]: warning: problem talking to service rewrite: Success
Mar 23 04:08:22 localhost postfix/master[18132]: warning: process /usr/libexec/postfix/trivial-rewrite pid 12713 exit status 1
Mar 23 04:08:26 localhost postfix/smtpd[10498]: 64A64217810: client=141-42.202-68.tampabay.res.rr.com[68.202.42.141]
Mar 23 04:08:27 localhost postfix/cleanup[12717]: 64A64217810: message-id=<4d897f3c.c0efcd62@ff.com>
Mar 23 04:08:27 localhost postfix/qmgr[18135]: 64A64217810: from=<klianags@ff.com>, size=869, nrcpt=1 (queue active)
Mar 23 04:08:27 localhost postfix/smtpd[10498]: disconnect from 141-42.202-68.tampabay.res.rr.com[68.202.42.141]
Mar 23 04:08:28 localhost amavis[11219]: (11219-15) LMTP::10024 /var/amavis/tmp/amavis-20110323T040154-11219: <klianags@ff.com> -> <user@example.org> SIZE=869 Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <user@example.org>; Wed, 23 Mar 2011 04:08:27 +0000 (GMT)
Mar 23 04:08:28 localhost amavis[11219]: (11219-15) Checking: 3ufmr+7F1HLH [68.202.42.141] <klianags@ff.com> -> <user@example.org>
Mar 23 04:08:43 localhost amavis[11219]: (11219-15) SA info: pyzor: [12722] error: TERMINATED, signal 15 (000f)
Mar 23 04:08:58 localhost postfix/smtpd[10498]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:04 localhost postfix/smtpd[10498]: lost connection after CONNECT from localhost.localdomain[127.0.0.1]
Mar 23 04:09:06 localhost postfix/smtpd[10498]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:07 localhost postfix/smtpd[12723]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:08 localhost postfix/smtpd[12723]: 705BB217812: client=localhost.localdomain[127.0.0.1]
Mar 23 04:09:08 localhost postfix/cleanup[12717]: 705BB217812: message-id=<4d897f3c.c0efcd62@ff.com>
Mar 23 04:09:08 localhost postfix/smtpd[12723]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:08 localhost amavis[11219]: (11219-15) FWD via SMTP: <klianags@ff.com> -> <user@example.org>,BODY=7BIT 250 2.0.0 Ok, id=11219-15, from MTA([127.0.0.1]:10025): 250 2.0.0 Ok: queued as 705BB217812
Mar 23 04:09:08 localhost postfix/qmgr[18135]: 705BB217812: from=<klianags@ff.com>, size=1994, nrcpt=1 (queue active)
Mar 23 04:09:08 localhost amavis[11219]: (11219-15) Passed SPAMMY, [68.202.42.141] [68.202.42.141] <klianags@ff.com> -> <user@example.org>, Message-ID: <4d897f3c.c0efcd62@ff.com>, mail_id: 3ufmr+7F1HLH, Hits: 42.456, size: 868, queued_as: 705BB217812, 40869 ms
Mar 23 04:09:08 localhost postfix/lmtp[12718]: 64A64217810: to=<user@example.org>, relay=127.0.0.1[127.0.0.1]:10024, delay=45, delays=3.7/0.39/0/41, dsn=2.0.0, status=sent (250 2.0.0 Ok, id=11219-15, from MTA([127.0.0.1]:10025): 250 2.0.0 Ok: queued as 705BB217812)
Mar 23 04:09:08 localhost postfix/qmgr[18135]: 64A64217810: removed
Mar 23 04:09:09 localhost postfix/smtpd[12725]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:09 localhost postfix/smtpd[12725]: lost connection after CONNECT from localhost.localdomain[127.0.0.1]
Mar 23 04:09:09 localhost postfix/smtpd[12725]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:12 localhost postfix/postfix-script: stopping the Postfix mail system
Mar 23 04:09:12 localhost postfix/lmtp[12727]: 705BB217812: to=<user@example.org>, relay=127.0.0.1[127.0.0.1]:2003, delay=4.9, delays=0.62/0.44/0.77/3.1, dsn=2.1.5, status=sent (250 2.1.5 user@example.org Ok)
Mar 23 04:09:12 localhost postfix/qmgr[18135]: 705BB217812: removed
Mar 23 04:09:13 localhost postfix/master[18132]: terminating on signal 15
Mar 23 04:09:16 localhost sendmail[12744]: alias database /etc/aliases rebuilt by root
Mar 23 04:09:16 localhost sendmail[12744]: /etc/aliases: 77 aliases, longest 10 bytes, 779 bytes total
Mar 23 04:09:28 localhost postfix/postfix-script: starting the Postfix mail system
...
...
...
Mar 23 04:20:36 localhost postfix/smtpd[12813]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:44 localhost postfix/smtpd[12813]: lost connection after CONNECT from localhost.localdomain[127.0.0.1]
Mar 23 04:20:44 localhost postfix/smtpd[12813]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:44 localhost postfix/smtpd[12813]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:48 localhost postfix/smtpd[12813]: DD22B217810: client=localhost.localdomain[127.0.0.1]
Mar 23 04:20:48 localhost postfix/cleanup[13003]: DD22B217810: message-id=<1300854048.1978973175@localhost.example.org>
Mar 23 04:20:48 localhost postfix/qmgr[12795]: DD22B217810: from=<monit@example.org>, size=776, nrcpt=1 (queue active)
Mar 23 04:20:48 localhost amavis[12945]: (12945-03) LMTP::10024 /var/amavis/tmp/amavis-20110323T042048-12945: <monit@example.org> -> <support@example.com> SIZE=776 BODY=8BITMIME Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <support@example.com>; Wed, 23 Mar 2011 04:20:48 +0000 (GMT)
Mar 23 04:20:48 localhost amavis[12945]: (12945-03) Checking: jowVanhCjxUE MYNETS [127.0.0.1] <monit@example.org> -> <support@example.com>
Mar 23 04:20:49 localhost postfix/smtpd[12813]: 2B849217812: client=localhost.localdomain[127.0.0.1]
Mar 23 04:20:49 localhost postfix/cleanup[13003]: 2B849217812: message-id=<1300854049.1531217627@localhost.example.org>
Mar 23 04:20:49 localhost postfix/qmgr[12795]: 2B849217812: from=<monit@example.org>, size=766, nrcpt=1 (queue active)
Mar 23 04:20:49 localhost postfix/smtpd[12813]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:49 localhost amavis[11435]: (11435-15) LMTP::10024 /var/amavis/tmp/amavis-20110323T041031-11435: <monit@example.org> -> <support@example.net> SIZE=766 BODY=8BITMIME Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <support@example.net>; Wed, 23 Mar 2011 04:20:49 +0000 (GMT)
Mar 23 04:20:49 localhost amavis[11435]: (11435-15) Checking: pqkSHT7EbonO MYNETS [127.0.0.1] <monit@example.org> -> <support@example.net>
Mar 23 04:20:50 localhost postfix/postfix-script: stopping the Postfix mail system
Mar 23 04:20:50 localhost postfix/master[12792]: terminating on signal 15
Mar 23 04:20:51 localhost sendmail[13025]: alias database /etc/aliases rebuilt by root
Mar 23 04:20:51 localhost sendmail[13025]: /etc/aliases: 77 aliases, longest 10 bytes, 779 bytes total
Mar 23 04:20:56 localhost amavis[11435]: (11435-15) (!)FWD via SMTP: <monit@example.org> -> <support@example.net>, 450 4.4.1 Can't connect to INET4 socket 127.0.0.1: Connection refused, MTA([127.0.0.1]:10025), id=11435-15
Mar 23 04:20:56 localhost amavis[11435]: (11435-15) Blocked MTA-BLOCKED, MYNETS LOCAL [127.0.0.1] [127.0.0.1] <monit@example.org> -> <support@example.net>, Message-ID: <1300854049.1531217627@localhost.example.org>, mail_id: pqkSHT7EbonO, Hits: -2.9, size: 766, 7495 ms
Mar 23 04:20:56 localhost amavis[11435]: (11435-15) (!!)TROUBLE in process_request: Error writing a SMTP response to the socket: Broken pipe at (eval 67) line 987, <GEN50> line 31.
Mar 23 04:20:57 localhost postfix/postfix-script: starting the Postfix mail system
Mar 23 04:20:57 localhost amavis[11435]: (11435-15) (!)Requesting process rundown after fatal error
Mar 23 04:20:57 localhost amavis[12945]: (12945-03) (!)FWD via SMTP: <monit@example.org> -> <support@example.com>, 450 4.4.1 Can't connect to INET4 socket 127.0.0.1: Connection refused, MTA([127.0.0.1]:10025), id=12945-03
Mar 23 04:20:57 localhost amavis[11435]: (11435-15) SMTP shutdown: Error writing a SMTP response to the socket:  at (eval 67) line
 987, <GEN50> line 31.\n
Mar 23 04:20:57 localhost amavis[12945]: (12945-03) Blocked MTA-BLOCKED, MYNETS LOCAL [127.0.0.1] [127.0.0.1] <monit@example.org> -> <support@example.com>, Message-ID: <1300854048.1978973175@localhost.example.org>, mail_id: jowVanhCjxUE, Hits: -2.9, size: 776, 9254 ms
Mar 23 04:20:57 localhost postfix/master[13076]: daemon started -- version 2.3.3, configuration /etc/postfix
Mar 23 04:20:57 localhost amavis[12945]: (12945-03) (!!)TROUBLE in process_request: Error writing a SMTP response to the socket: Broken pipe at (eval 67) line 987, <GEN21> line 31.
Mar 23 04:20:57 localhost amavis[12945]: (12945-03) (!)Requesting process rundown after fatal error
Mar 23 04:20:58 localhost amavis[12945]: (12945-03) SMTP shutdown: Error writing a SMTP response to the socket:  at (eval 67) line 987, <GEN21> line 31.\n
Mar 23 04:20:58 localhost postfix/qmgr[13079]: DD22B217810: from=<monit@example.org>, size=776, nrcpt=1 (queue active)
Mar 23 04:20:58 localhost postfix/qmgr[13079]: 2B849217812: from=<monit@example.org>, size=766, nrcpt=1 (queue active)
Mar 23 04:20:58 localhost amavis[13096]: (13096-01) LMTP::10024 /var/amavis/tmp/amavis-20110323T042058-13096: <monit@example.org> -> <support@example.net> SIZE=766 BODY=8BITMIME Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <support@example.net>; Wed, 23 Mar 2011 04:20:58 +0000 (GMT)
Mar 23 04:20:58 localhost amavis[13097]: (13097-01) LMTP::10024 /var/amavis/tmp/amavis-20110323T042058-13097: <monit@example.org> -> <support@example.com> SIZE=776 BODY=8BITMIME Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <support@example.com>; Wed, 23 Mar 2011 04:20:58 +0000 (GMT)
Mar 23 04:20:58 localhost amavis[13097]: (13097-01) Checking: P8WXuWzzphZ0 MYNETS [127.0.0.1] <monit@example.org> -> <support@example.com>
Mar 23 04:20:58 localhost amavis[13096]: (13096-01) Checking: WoSGo1iYwCAl MYNETS [127.0.0.1] <monit@example.org> -> <support@example.net>
Mar 23 04:20:58 localhost amavis[13096]: (13096-01) cached 3bae5ab78a88d7af26bf16055ca3bd5c from <monit@example.org> (1,1)
Mar 23 04:20:58 localhost amavis[13097]: (13097-01) cached 3bae5ab78a88d7af26bf16055ca3bd5c from <monit@example.org> (1,1)
Mar 23 04:20:59 localhost postfix/smtpd[13100]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:59 localhost postfix/smtpd[13102]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:59 localhost postfix/smtpd[13100]: 0C963217817: client=localhost.localdomain[127.0.0.1]
Mar 23 04:20:59 localhost postfix/cleanup[13104]: 0C963217817: message-id=<1300854048.1978973175@localhost.example.org>
Mar 23 04:20:59 localhost postfix/qmgr[13079]: 0C963217817: from=<monit@example.org>, size=1016, nrcpt=1 (queue active)
Mar 23 04:20:59 localhost amavis[13097]: (13097-01) FWD via SMTP: <monit@example.org> -> <support@example.com>,BODY=7BIT 250 2.0.0 Ok, id=13097-01, from MTA([127.0.0.1]:10025): 250 2.0.0 Ok: queued as 0C963217817
Mar 23 04:20:59 localhost postfix/smtpd[13102]: 0F78C21781C: client=localhost.localdomain[127.0.0.1]
Mar 23 04:20:59 localhost postfix/cleanup[13104]: 0F78C21781C: message-id=<1300854049.1531217627@localhost.example.org>
Mar 23 04:20:59 localhost amavis[13097]: (13097-01) Passed CLEAN, MYNETS LOCAL [127.0.0.1] [127.0.0.1] <monit@example.org> -> <support@example.com>, Message-ID: <1300854048.1978973175@localhost.example.org>, mail_id: P8WXuWzzphZ0, Hits: -2.9, size: 776, queued_as: 0C963217817, 173 ms
...

It's looking as though ClamAV might be overloading the system, at which point postfix chokes. In this instance Amavis appears to be processing a spam email at the same time as the clamdscan, perhaps the two have to happen at the same time for the system to reach critical load. That could explain the intermittent nature of the problem - this server gets enough mail that it's likely to be scanning a message at the time of the monit connectivity check, but it's not guaranteed.

I've enabled monit's -v option in the initscript, increased the postfix connect timeout in monit, and increased clamdscan's niceness. Will see what happens over the next few days and report back.

Thanks for the feedback so far.

Gethyn 03-31-2011 04:32 AM

Just a quick update - I've not had any problems from monit since increasing the nice level of the cron jobs and increasing monit's timeout for the services in question. Evidently the trouble was caused by excessive system load. Thanks for the help!


All times are GMT -5. The time now is 08:56 AM.