LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 03-18-2011, 05:05 AM   #1
Gethyn
Member
 
Registered: Aug 2003
Location: UK
Distribution: (X)Ubuntu 10.04/10.10, Debian 5, CentOS 5
Posts: 900

Rep: Reputation: 32
Post Postfix / cron / monit problem on CentOS


Hi all,

I've run into a strange problem on one of the servers I maintain. It's running CentOS 5, fully up to date. The server has Monit installed via RPMforge, and runs Postfix as its MTA. Every day at about 4:20am, when cron is running, Monit loses its connection to Postfix. As far as I can tell, nothing is actually going wrong with Postfix and it's always restarted just fine. I run a couple of Debian servers with very similar setups, and the problem does not occur on them.

Perhaps related, other daemons on the CentOS box often lose their pid files at the time when cron runs, which results in Monit thinking they're not running. One service in particular (zarafa-dagent) is so bad for this that I've had to completely disable monitoring of it in Monit. Again, this problem doesn't occur on the Debian boxes I maintain.

The problem isn't hugely urgent as everything still runs fine, but it does mean that pretty much every day I get false positive mail alerts from Monit, which is (a) annoying and (b) makes me more likely to miss a real alert.

Any light anyone can shed on this would be greatly appreciated! Thanks in advance.
 
Old 03-19-2011, 05:48 AM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Have a look in /var/log/cron at what cronjobs (logrotate?) are run by which users at what time?
 
Old 03-23-2011, 09:12 AM   #3
Gethyn
Member
 
Registered: Aug 2003
Location: UK
Distribution: (X)Ubuntu 10.04/10.10, Debian 5, CentOS 5
Posts: 900

Original Poster
Rep: Reputation: 32
Last night the connections failed at 4:10:30 and 4:21:58. Here's the cron log from 4:00 to 4:30...
Code:
Mar 23 04:00:01 localhost crond[11485]: (root) CMD (/usr/lib/sa/sa1 1 1)
Mar 23 04:01:01 localhost crond[11505]: (root) CMD (run-parts /etc/cron.hourly)
Mar 23 04:02:01 localhost crond[11540]: (root) CMD (run-parts /etc/cron.daily)
Mar 23 04:02:01 localhost anacron[11544]: Updated timestamp for job `cron.daily' to 2011-03-23
Mar 23 04:02:12 localhost /usr/bin/crontab[11980]: (root) LIST (nobody)
Mar 23 04:03:01 localhost crond[12645]: (root) CMD (ntpdate ntp0.linx.net > /dev/null)
Mar 23 04:10:02 localhost crond[12807]: (root) CMD (/usr/lib/sa/sa1 1 1)
Mar 23 04:20:01 localhost crond[12988]: (root) CMD (/usr/lib/sa/sa1 1 1)
Mar 23 04:30:01 localhost crond[13611]: (root) CMD (/usr/lib/sa/sa1 1 1)
There is nothing in cron.hourly and according to crontab -l, both root and nobody have empty user crontabs. The contents of cron.daily are:
Code:
0anacron
0logwatch
amavisd
chkrootkit
clamdscan-samba-shares
cups
freshclam
logrotate
makewhatis.cron
mlocate.cron
mysql-backup
prelink
rkhunter
rpm
tmpwatch
train-spamassassin
Of these, chkrootkit and rkhunter are custom scripts based on their Debian equivalents, clamdscan-samba-shares is a custom script, mysql-backups is a shim that calls automysqlbackup and train-spamassassin is based on this sa-learn script, modified to work with amavisd.

I guess I'll have to figure this out the hard way, disabling each custom script one at a time and seeing if they affect the situation. Since it's an intermittent problem this is going to take a long time though. I was hoping someone might have seen a similar problem in the past and have some specific comment... (still hoping, any further suggestions would still be appreciated!)
 
Old 03-23-2011, 11:23 AM   #4
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Is there any other system or daemon log that shows activity (or better: errors) between 4:00 and 4:30? How about running monit in verbose ("-v") mode?
 
Old 03-23-2011, 12:57 PM   #5
Gethyn
Member
 
Registered: Aug 2003
Location: UK
Distribution: (X)Ubuntu 10.04/10.10, Debian 5, CentOS 5
Posts: 900

Original Poster
Rep: Reputation: 32
There are a couple of interesting log snippets:
Code:
/var/log/clamav/clamd.log
...
Wed Mar 23 04:07:15 2011 -> WARNING: Directory recursion limit reached, skipping XXXXX
Wed Mar 23 04:08:26 2011 -> XXXXX: Can't open file or directory ERROR
Wed Mar 23 04:11:34 2011 -> XXXXX: Can't open file or directory ERROR
Wed Mar 23 04:14:35 2011 -> SelfCheck: Database status OK.
Wed Mar 23 04:15:14 2011 -> XXXXX: Can't open file or directory ERROR
Wed Mar 23 04:15:14 2011 -> XXXXX: Can't open file or directory ERROR
Wed Mar 23 04:20:58 2011 -> Reading databases from /var/clamav
Wed Mar 23 04:21:01 2011 -> Database correctly reloaded (921596 signatures)
...
I've edited out the actual file names for privacy reasons, but they're all user data rather than system files. The "Can't open file or directory" ones are files over ClamAV's size limit (>2GB). Unfortunately it doesn't seem be possible to get ClamAV to quietly ignore oversize files. The ClamAV database is getting rebuilt around this time too.

Here are the relevant bits of the maillog (apologies for length):
Code:
/var/log/maillog
...
Mar 23 04:08:07 localhost postfix/smtpd[10498]: connect from 141-42.202-68.tampabay.res.rr.com[68.202.42.141]
Mar 23 04:08:19 localhost postfix/trivial-rewrite[12713]: warning: dict_ldap_lookup: Search error -5: Timed out 
Mar 23 04:08:19 localhost postfix/trivial-rewrite[12713]: fatal: ldap:ldap_virtual_alias(0,lock|fold_fix): table lookup problem
Mar 23 04:08:22 localhost postfix/smtpd[10498]: warning: premature end-of-input on private/rewrite socket while reading input attribute name
Mar 23 04:08:22 localhost postfix/smtpd[10498]: warning: problem talking to service rewrite: Success
Mar 23 04:08:22 localhost postfix/master[18132]: warning: process /usr/libexec/postfix/trivial-rewrite pid 12713 exit status 1
Mar 23 04:08:26 localhost postfix/smtpd[10498]: 64A64217810: client=141-42.202-68.tampabay.res.rr.com[68.202.42.141]
Mar 23 04:08:27 localhost postfix/cleanup[12717]: 64A64217810: message-id=<4d897f3c.c0efcd62@ff.com>
Mar 23 04:08:27 localhost postfix/qmgr[18135]: 64A64217810: from=<klianags@ff.com>, size=869, nrcpt=1 (queue active)
Mar 23 04:08:27 localhost postfix/smtpd[10498]: disconnect from 141-42.202-68.tampabay.res.rr.com[68.202.42.141]
Mar 23 04:08:28 localhost amavis[11219]: (11219-15) LMTP::10024 /var/amavis/tmp/amavis-20110323T040154-11219: <klianags@ff.com> -> <user@example.org> SIZE=869 Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <user@example.org>; Wed, 23 Mar 2011 04:08:27 +0000 (GMT)
Mar 23 04:08:28 localhost amavis[11219]: (11219-15) Checking: 3ufmr+7F1HLH [68.202.42.141] <klianags@ff.com> -> <user@example.org>
Mar 23 04:08:43 localhost amavis[11219]: (11219-15) SA info: pyzor: [12722] error: TERMINATED, signal 15 (000f)
Mar 23 04:08:58 localhost postfix/smtpd[10498]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:04 localhost postfix/smtpd[10498]: lost connection after CONNECT from localhost.localdomain[127.0.0.1]
Mar 23 04:09:06 localhost postfix/smtpd[10498]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:07 localhost postfix/smtpd[12723]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:08 localhost postfix/smtpd[12723]: 705BB217812: client=localhost.localdomain[127.0.0.1]
Mar 23 04:09:08 localhost postfix/cleanup[12717]: 705BB217812: message-id=<4d897f3c.c0efcd62@ff.com>
Mar 23 04:09:08 localhost postfix/smtpd[12723]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:08 localhost amavis[11219]: (11219-15) FWD via SMTP: <klianags@ff.com> -> <user@example.org>,BODY=7BIT 250 2.0.0 Ok, id=11219-15, from MTA([127.0.0.1]:10025): 250 2.0.0 Ok: queued as 705BB217812
Mar 23 04:09:08 localhost postfix/qmgr[18135]: 705BB217812: from=<klianags@ff.com>, size=1994, nrcpt=1 (queue active)
Mar 23 04:09:08 localhost amavis[11219]: (11219-15) Passed SPAMMY, [68.202.42.141] [68.202.42.141] <klianags@ff.com> -> <user@example.org>, Message-ID: <4d897f3c.c0efcd62@ff.com>, mail_id: 3ufmr+7F1HLH, Hits: 42.456, size: 868, queued_as: 705BB217812, 40869 ms
Mar 23 04:09:08 localhost postfix/lmtp[12718]: 64A64217810: to=<user@example.org>, relay=127.0.0.1[127.0.0.1]:10024, delay=45, delays=3.7/0.39/0/41, dsn=2.0.0, status=sent (250 2.0.0 Ok, id=11219-15, from MTA([127.0.0.1]:10025): 250 2.0.0 Ok: queued as 705BB217812)
Mar 23 04:09:08 localhost postfix/qmgr[18135]: 64A64217810: removed
Mar 23 04:09:09 localhost postfix/smtpd[12725]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:09 localhost postfix/smtpd[12725]: lost connection after CONNECT from localhost.localdomain[127.0.0.1]
Mar 23 04:09:09 localhost postfix/smtpd[12725]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:09:12 localhost postfix/postfix-script: stopping the Postfix mail system
Mar 23 04:09:12 localhost postfix/lmtp[12727]: 705BB217812: to=<user@example.org>, relay=127.0.0.1[127.0.0.1]:2003, delay=4.9, delays=0.62/0.44/0.77/3.1, dsn=2.1.5, status=sent (250 2.1.5 user@example.org Ok)
Mar 23 04:09:12 localhost postfix/qmgr[18135]: 705BB217812: removed
Mar 23 04:09:13 localhost postfix/master[18132]: terminating on signal 15
Mar 23 04:09:16 localhost sendmail[12744]: alias database /etc/aliases rebuilt by root
Mar 23 04:09:16 localhost sendmail[12744]: /etc/aliases: 77 aliases, longest 10 bytes, 779 bytes total
Mar 23 04:09:28 localhost postfix/postfix-script: starting the Postfix mail system
...
...
...
Mar 23 04:20:36 localhost postfix/smtpd[12813]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:44 localhost postfix/smtpd[12813]: lost connection after CONNECT from localhost.localdomain[127.0.0.1]
Mar 23 04:20:44 localhost postfix/smtpd[12813]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:44 localhost postfix/smtpd[12813]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:48 localhost postfix/smtpd[12813]: DD22B217810: client=localhost.localdomain[127.0.0.1]
Mar 23 04:20:48 localhost postfix/cleanup[13003]: DD22B217810: message-id=<1300854048.1978973175@localhost.example.org>
Mar 23 04:20:48 localhost postfix/qmgr[12795]: DD22B217810: from=<monit@example.org>, size=776, nrcpt=1 (queue active)
Mar 23 04:20:48 localhost amavis[12945]: (12945-03) LMTP::10024 /var/amavis/tmp/amavis-20110323T042048-12945: <monit@example.org> -> <support@example.com> SIZE=776 BODY=8BITMIME Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <support@example.com>; Wed, 23 Mar 2011 04:20:48 +0000 (GMT)
Mar 23 04:20:48 localhost amavis[12945]: (12945-03) Checking: jowVanhCjxUE MYNETS [127.0.0.1] <monit@example.org> -> <support@example.com>
Mar 23 04:20:49 localhost postfix/smtpd[12813]: 2B849217812: client=localhost.localdomain[127.0.0.1]
Mar 23 04:20:49 localhost postfix/cleanup[13003]: 2B849217812: message-id=<1300854049.1531217627@localhost.example.org>
Mar 23 04:20:49 localhost postfix/qmgr[12795]: 2B849217812: from=<monit@example.org>, size=766, nrcpt=1 (queue active)
Mar 23 04:20:49 localhost postfix/smtpd[12813]: disconnect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:49 localhost amavis[11435]: (11435-15) LMTP::10024 /var/amavis/tmp/amavis-20110323T041031-11435: <monit@example.org> -> <support@example.net> SIZE=766 BODY=8BITMIME Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <support@example.net>; Wed, 23 Mar 2011 04:20:49 +0000 (GMT)
Mar 23 04:20:49 localhost amavis[11435]: (11435-15) Checking: pqkSHT7EbonO MYNETS [127.0.0.1] <monit@example.org> -> <support@example.net>
Mar 23 04:20:50 localhost postfix/postfix-script: stopping the Postfix mail system
Mar 23 04:20:50 localhost postfix/master[12792]: terminating on signal 15
Mar 23 04:20:51 localhost sendmail[13025]: alias database /etc/aliases rebuilt by root
Mar 23 04:20:51 localhost sendmail[13025]: /etc/aliases: 77 aliases, longest 10 bytes, 779 bytes total
Mar 23 04:20:56 localhost amavis[11435]: (11435-15) (!)FWD via SMTP: <monit@example.org> -> <support@example.net>, 450 4.4.1 Can't connect to INET4 socket 127.0.0.1: Connection refused, MTA([127.0.0.1]:10025), id=11435-15
Mar 23 04:20:56 localhost amavis[11435]: (11435-15) Blocked MTA-BLOCKED, MYNETS LOCAL [127.0.0.1] [127.0.0.1] <monit@example.org> -> <support@example.net>, Message-ID: <1300854049.1531217627@localhost.example.org>, mail_id: pqkSHT7EbonO, Hits: -2.9, size: 766, 7495 ms
Mar 23 04:20:56 localhost amavis[11435]: (11435-15) (!!)TROUBLE in process_request: Error writing a SMTP response to the socket: Broken pipe at (eval 67) line 987, <GEN50> line 31.
Mar 23 04:20:57 localhost postfix/postfix-script: starting the Postfix mail system
Mar 23 04:20:57 localhost amavis[11435]: (11435-15) (!)Requesting process rundown after fatal error
Mar 23 04:20:57 localhost amavis[12945]: (12945-03) (!)FWD via SMTP: <monit@example.org> -> <support@example.com>, 450 4.4.1 Can't connect to INET4 socket 127.0.0.1: Connection refused, MTA([127.0.0.1]:10025), id=12945-03
Mar 23 04:20:57 localhost amavis[11435]: (11435-15) SMTP shutdown: Error writing a SMTP response to the socket:  at (eval 67) line
 987, <GEN50> line 31.\n
Mar 23 04:20:57 localhost amavis[12945]: (12945-03) Blocked MTA-BLOCKED, MYNETS LOCAL [127.0.0.1] [127.0.0.1] <monit@example.org> -> <support@example.com>, Message-ID: <1300854048.1978973175@localhost.example.org>, mail_id: jowVanhCjxUE, Hits: -2.9, size: 776, 9254 ms
Mar 23 04:20:57 localhost postfix/master[13076]: daemon started -- version 2.3.3, configuration /etc/postfix
Mar 23 04:20:57 localhost amavis[12945]: (12945-03) (!!)TROUBLE in process_request: Error writing a SMTP response to the socket: Broken pipe at (eval 67) line 987, <GEN21> line 31.
Mar 23 04:20:57 localhost amavis[12945]: (12945-03) (!)Requesting process rundown after fatal error
Mar 23 04:20:58 localhost amavis[12945]: (12945-03) SMTP shutdown: Error writing a SMTP response to the socket:  at (eval 67) line 987, <GEN21> line 31.\n
Mar 23 04:20:58 localhost postfix/qmgr[13079]: DD22B217810: from=<monit@example.org>, size=776, nrcpt=1 (queue active)
Mar 23 04:20:58 localhost postfix/qmgr[13079]: 2B849217812: from=<monit@example.org>, size=766, nrcpt=1 (queue active)
Mar 23 04:20:58 localhost amavis[13096]: (13096-01) LMTP::10024 /var/amavis/tmp/amavis-20110323T042058-13096: <monit@example.org> -> <support@example.net> SIZE=766 BODY=8BITMIME Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <support@example.net>; Wed, 23 Mar 2011 04:20:58 +0000 (GMT)
Mar 23 04:20:58 localhost amavis[13097]: (13097-01) LMTP::10024 /var/amavis/tmp/amavis-20110323T042058-13097: <monit@example.org> -> <support@example.com> SIZE=776 BODY=8BITMIME Received: from mail.example.org ([127.0.0.1]) by localhost (localhost.example.org [127.0.0.1]) (amavisd-new, port 10024) with LMTP for <support@example.com>; Wed, 23 Mar 2011 04:20:58 +0000 (GMT)
Mar 23 04:20:58 localhost amavis[13097]: (13097-01) Checking: P8WXuWzzphZ0 MYNETS [127.0.0.1] <monit@example.org> -> <support@example.com>
Mar 23 04:20:58 localhost amavis[13096]: (13096-01) Checking: WoSGo1iYwCAl MYNETS [127.0.0.1] <monit@example.org> -> <support@example.net>
Mar 23 04:20:58 localhost amavis[13096]: (13096-01) cached 3bae5ab78a88d7af26bf16055ca3bd5c from <monit@example.org> (1,1)
Mar 23 04:20:58 localhost amavis[13097]: (13097-01) cached 3bae5ab78a88d7af26bf16055ca3bd5c from <monit@example.org> (1,1)
Mar 23 04:20:59 localhost postfix/smtpd[13100]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:59 localhost postfix/smtpd[13102]: connect from localhost.localdomain[127.0.0.1]
Mar 23 04:20:59 localhost postfix/smtpd[13100]: 0C963217817: client=localhost.localdomain[127.0.0.1]
Mar 23 04:20:59 localhost postfix/cleanup[13104]: 0C963217817: message-id=<1300854048.1978973175@localhost.example.org>
Mar 23 04:20:59 localhost postfix/qmgr[13079]: 0C963217817: from=<monit@example.org>, size=1016, nrcpt=1 (queue active)
Mar 23 04:20:59 localhost amavis[13097]: (13097-01) FWD via SMTP: <monit@example.org> -> <support@example.com>,BODY=7BIT 250 2.0.0 Ok, id=13097-01, from MTA([127.0.0.1]:10025): 250 2.0.0 Ok: queued as 0C963217817
Mar 23 04:20:59 localhost postfix/smtpd[13102]: 0F78C21781C: client=localhost.localdomain[127.0.0.1]
Mar 23 04:20:59 localhost postfix/cleanup[13104]: 0F78C21781C: message-id=<1300854049.1531217627@localhost.example.org>
Mar 23 04:20:59 localhost amavis[13097]: (13097-01) Passed CLEAN, MYNETS LOCAL [127.0.0.1] [127.0.0.1] <monit@example.org> -> <support@example.com>, Message-ID: <1300854048.1978973175@localhost.example.org>, mail_id: P8WXuWzzphZ0, Hits: -2.9, size: 776, queued_as: 0C963217817, 173 ms
...
It's looking as though ClamAV might be overloading the system, at which point postfix chokes. In this instance Amavis appears to be processing a spam email at the same time as the clamdscan, perhaps the two have to happen at the same time for the system to reach critical load. That could explain the intermittent nature of the problem - this server gets enough mail that it's likely to be scanning a message at the time of the monit connectivity check, but it's not guaranteed.

I've enabled monit's -v option in the initscript, increased the postfix connect timeout in monit, and increased clamdscan's niceness. Will see what happens over the next few days and report back.

Thanks for the feedback so far.
 
Old 03-31-2011, 04:32 AM   #6
Gethyn
Member
 
Registered: Aug 2003
Location: UK
Distribution: (X)Ubuntu 10.04/10.10, Debian 5, CentOS 5
Posts: 900

Original Poster
Rep: Reputation: 32
Just a quick update - I've not had any problems from monit since increasing the nice level of the cron jobs and increasing monit's timeout for the services in question. Evidently the trouble was caused by excessive system load. Thanks for the help!
 
  


Reply

Tags
centos, cron, monit, postfix, rpmforge


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cron, DDclient and CentOS 5.1 metallica1973 Linux - Server 5 12-10-2008 11:01 PM
LXer: Server Monitoring With munin And monit On CentOS 5.2 LXer Syndicated Linux News 0 12-03-2008 10:30 PM
Postfix TLS problem on CentOS 5.2 robertjinx Linux - Server 10 11-02-2008 07:50 AM
Problem with Setting up monit helptonewbie Linux - Software 2 07-29-2007 04:24 PM
postfix/ldap/monit : where does the problem lie topcat Linux - Software 1 11-23-2004 10:28 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 04:34 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration