LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-18-2004, 11:52 AM   #1
Hans Zarkoff
Member
 
Registered: Apr 2002
Location: Canada
Distribution: Slackware
Posts: 58

Rep: Reputation: 15
AWK script is hanging our server


In short, I run an awk script that extracts IPs from our /var/log/messages file every day. The messages file is *only* for each day, as I rotate it to a new file, messages.MMDDYYYY. It sorts the IPs into an array using the IP as the index and the count of each IP as the value. It then builds a tcp.smtp file to be used by qmail to deny IPs that have sent us too much mail.

The messages file can be over 800000 lines.

I've included the script below but I'm not sure it's the actual problem. The script hangs in the array construction. I've tried these things:

1. Nice'ing it.

2. Splitting the messages file to only 500000 lines. That didn't work. Reducing it further to say, a tiny file of 100000 lines, does work, but that's impractical.

3. As a test, instead of building an array, I just spewed the IPs to a file. That also hung the machine.

If I run top while it's running, it does consume 99% of a processor (We have a dual PII-500 with 256MB) but only a tiny amount of memory. That leaves another processor able to handle everything else. These processors have a pretty easy life--This machine is used mostly as a mail server so, apart from a little bit of memory for the outgoing mail queue, the machine isn't taxed too hard, except for this script which should run at 11:55 every night.

It's slackware 9.1 with gawk 3.1.3. The latest according to gnu.org.

What's odd is it worked fine this winter. I'm almost wondering if it's a heat problem. I live in central Canada so it's either -40F or 104F outside. (-40C or 40C) Inside, we keep it a balmy 68F during the winter but it can get up to 80+F in the summer. It's possible it's a heat problem. If they processors don't usually have much to do, they won't overheat, but perhaps when this script runs, they push the processors a little hard and they melt.

I'm hoping someone has a better idea. Any suggestions?

#!/bin/bash

/usr/bin/nice -n 5 /usr/bin/awk '

BEGIN {
FS=" "
system("echo \"Starting awk script...\" > logrotate.log") }


/.*smtpd.*from / {
if ( $11 in entry )
entry[$11] += 1
else
entry[$11] = 1
}

END {
system("echo \"Done filling array...\" >> logrotate.log")
printf("# This file is automatically generated from /root/rootscripts/logrotate.\n\n127.:allow,RELAYCLIENT=\"\"\n192.168.:allow,RELAYCLIENT=\"\"\n\n")

for ( ent in entry )
if ( entry[ent] > 100 && ent != "66.246.137.251" )
printf("%s:deny\n# Hits for %s: %s\n\n", ent,ent,entry[ent])
printf(":allow\n")
system("echo \"Script finished\" >> logrotate.log")
}' /var/log/messages.07162004 > /var/log/tcp.smtp.autogen
 
Old 07-18-2004, 12:46 PM   #2
peter_robb
Senior Member
 
Registered: Feb 2002
Location: Szczecin, Poland
Distribution: Gentoo, Debian
Posts: 2,458

Rep: Reputation: 48
Most of that script file is awk command, obviously apart from the nice reference..

Try changing the #! script reference to #! /bin/awk
and use awk as the interpreter rather than calling it from bash.
 
Old 07-18-2004, 01:51 PM   #3
Hans Zarkoff
Member
 
Registered: Apr 2002
Location: Canada
Distribution: Slackware
Posts: 58

Original Poster
Rep: Reputation: 15
Buh... whu... How did you know how to do that? Let me be the first to bow before thee.......

Thank you, that seems to have fixed it. I'll post again if I hit a snag.
 
Old 07-18-2004, 04:54 PM   #4
peter_robb
Senior Member
 
Registered: Feb 2002
Location: Szczecin, Poland
Distribution: Gentoo, Debian
Posts: 2,458

Rep: Reputation: 48


I have been reading the O'Reilly book "Effective Awk Programming"..

Nice book, lots to absorb.. highly recommended. It's online and in the gawk sources..

Regards,
Peter
 
Old 07-21-2004, 08:17 PM   #5
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Posts: 952

Rep: Reputation: 217Reputation: 217Reputation: 217
In addition, associative arrays dont need an "if" validation so you might use it as below

{entry[$11]++}

instead of:

if ( $11 in entry )
entry[$11] += 1
else
entry[$11] = 1

It might also speed it up.

End.
 
Old 07-22-2004, 06:02 PM   #6
Hans Zarkoff
Member
 
Registered: Apr 2002
Location: Canada
Distribution: Slackware
Posts: 58

Original Poster
Rep: Reputation: 15
Even if that does nothing, it's much more elegant! I will change it immediately.

One thing though. If the entry is not yet in the array, will its value initialize to 0? Does awk guarantee it initializes to 0? I don't come from an interpreter background so *assuming* such things makes me a little nervous...

It still hangs the entire machine. One point (a very important one) I forgot to mention in my first message is it does not hang every time. It only started in the hotter months, but not every time. Just most times.

I'm moving my office to an air conditioned office at the beginning of next month. I shall try it there. I'll bug you smart people if it keeps happening.

Thank you all for your help.
 
Old 07-24-2004, 03:19 PM   #7
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Posts: 952

Rep: Reputation: 217Reputation: 217Reputation: 217
I'm not really sure about the weather thing. If it melted, thats it. I would think thank that "sometimes hang" could be due to the number of entries generted in "ent" varying.

I think the key is to reduce the size of the entries.

To lessen the size of "ent", maybe you could put a more effective pattern matching filter in the main module rather than ON-EOF: thus:

{if ($11 > 100 & $11 != "66.246.137.251")
entry[$11]++
}

Else you could try fancy `sed`ding or a shell script / smal c program that acts as a filter that looks for numbers > 100 and having four comma separated IP addresses then pass it's output onto awk...

Also, I'm not sure whether it's OK to write to var/log/tcp... as you do (redirect all output) in the last line of your shell scrit (which runs awk).

HTH

End.
 
Old 07-24-2004, 08:59 PM   #8
Hans Zarkoff
Member
 
Registered: Apr 2002
Location: Canada
Distribution: Slackware
Posts: 58

Original Poster
Rep: Reputation: 15
Errr.... Your idea to insert what I have in END in the main module won't work. I have to get a count of each IP in the entire file before I can determine what should be written to the output file.

You wrote "if($11 > 100". I think you mean, "if(entry[$11] > 100". Since entry[$11] starts at 0, this will never be true as there's no way to increment it. Unless I'm missing something.

As for the weather issue, you're probably right. I would imagine someone from India knows far more about heat than a Canadian does.

Thanks anyway though.
 
Old 07-25-2004, 01:51 AM   #9
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Posts: 952

Rep: Reputation: 217Reputation: 217Reputation: 217
I'm sorry for trying to second guess you. I thought > 100 was for checking that somehow what you get is a valid IP number. So please ignore it.

What I was hoping to say was, if the log file contains messages not related to IP addresses, you might filter them out and lessen the search since I thought /var/log/messages.07162004 might contain other entries too.

Very interesting thread and this data size will probably test the limits of many builds and distros. Please do post the final resolutionn even if it's a workaround.

End
 
Old 07-27-2004, 08:59 AM   #10
Hans Zarkoff
Member
 
Registered: Apr 2002
Location: Canada
Distribution: Slackware
Posts: 58

Original Poster
Rep: Reputation: 15
I will do so.

It seems to work more consistently when I call awk -f... from the command line than when it's called from crontab.

Very odd....
 
Old 08-20-2004, 01:32 PM   #11
Hans Zarkoff
Member
 
Registered: Apr 2002
Location: Canada
Distribution: Slackware
Posts: 58

Original Poster
Rep: Reputation: 15
It is an interesting thread. I have found a solution.

I rewrote the script in C. The differences are incredible.

In awk, it took about five minutes before it finished, or hung.

In C, the entire process is done in under ten seconds. At first, I didn't think it was working because it finished too fast. Imagine my surprise (and joy!) when I discovered that it was working, and absurdly fast at that.

As a side note, the first version in C choked on large log files as well. It didn't hang the machine, but I kept getting segmentation faults. I stored the IPs and counts in a linked list. I switched to a binary tree and that fixed it. I thought this was worth mentioning.

I posted a little web page here: http://www.armchair.mb.ca/~richard/nospam for those who want to take a look.

Thank you to everyone for their help.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
writing awk script files bigmark Linux - Software 1 10-19-2005 10:33 AM
About awk script sachin_keluskar Linux - Software 2 06-24-2005 03:19 AM
awk script forgets tabs iluvatar Linux - Software 2 11-04-2004 01:56 AM
Passing variables from AWK script to my shell script BigLarry Programming 1 06-12-2004 04:32 AM
How do I run an awk script? davee Programming 2 08-12-2003 08:46 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:50 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration