ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
In short, I run an awk script that extracts IPs from our /var/log/messages file every day. The messages file is *only* for each day, as I rotate it to a new file, messages.MMDDYYYY. It sorts the IPs into an array using the IP as the index and the count of each IP as the value. It then builds a tcp.smtp file to be used by qmail to deny IPs that have sent us too much mail.
The messages file can be over 800000 lines.
I've included the script below but I'm not sure it's the actual problem. The script hangs in the array construction. I've tried these things:
1. Nice'ing it.
2. Splitting the messages file to only 500000 lines. That didn't work. Reducing it further to say, a tiny file of 100000 lines, does work, but that's impractical.
3. As a test, instead of building an array, I just spewed the IPs to a file. That also hung the machine.
If I run top while it's running, it does consume 99% of a processor (We have a dual PII-500 with 256MB) but only a tiny amount of memory. That leaves another processor able to handle everything else. These processors have a pretty easy life--This machine is used mostly as a mail server so, apart from a little bit of memory for the outgoing mail queue, the machine isn't taxed too hard, except for this script which should run at 11:55 every night.
It's slackware 9.1 with gawk 3.1.3. The latest according to gnu.org.
What's odd is it worked fine this winter. I'm almost wondering if it's a heat problem. I live in central Canada so it's either -40F or 104F outside. (-40C or 40C) Inside, we keep it a balmy 68F during the winter but it can get up to 80+F in the summer. It's possible it's a heat problem. If they processors don't usually have much to do, they won't overheat, but perhaps when this script runs, they push the processors a little hard and they melt.
I'm hoping someone has a better idea. Any suggestions?
/.*smtpd.*from / {
if ( $11 in entry )
entry[$11] += 1
else
entry[$11] = 1
}
END {
system("echo \"Done filling array...\" >> logrotate.log")
printf("# This file is automatically generated from /root/rootscripts/logrotate.\n\n127.:allow,RELAYCLIENT=\"\"\n192.168.:allow,RELAYCLIENT=\"\"\n\n")
for ( ent in entry )
if ( entry[ent] > 100 && ent != "66.246.137.251" )
printf("%s:deny\n# Hits for %s: %s\n\n", ent,ent,entry[ent])
printf(":allow\n")
system("echo \"Script finished\" >> logrotate.log")
}' /var/log/messages.07162004 > /var/log/tcp.smtp.autogen
Even if that does nothing, it's much more elegant! I will change it immediately.
One thing though. If the entry is not yet in the array, will its value initialize to 0? Does awk guarantee it initializes to 0? I don't come from an interpreter background so *assuming* such things makes me a little nervous...
It still hangs the entire machine. One point (a very important one) I forgot to mention in my first message is it does not hang every time. It only started in the hotter months, but not every time. Just most times.
I'm moving my office to an air conditioned office at the beginning of next month. I shall try it there. I'll bug you smart people if it keeps happening.
I'm not really sure about the weather thing. If it melted, thats it. I would think thank that "sometimes hang" could be due to the number of entries generted in "ent" varying.
I think the key is to reduce the size of the entries.
To lessen the size of "ent", maybe you could put a more effective pattern matching filter in the main module rather than ON-EOF: thus:
Else you could try fancy `sed`ding or a shell script / smal c program that acts as a filter that looks for numbers > 100 and having four comma separated IP addresses then pass it's output onto awk...
Also, I'm not sure whether it's OK to write to var/log/tcp... as you do (redirect all output) in the last line of your shell scrit (which runs awk).
Errr.... Your idea to insert what I have in END in the main module won't work. I have to get a count of each IP in the entire file before I can determine what should be written to the output file.
You wrote "if($11 > 100". I think you mean, "if(entry[$11] > 100". Since entry[$11] starts at 0, this will never be true as there's no way to increment it. Unless I'm missing something.
As for the weather issue, you're probably right. I would imagine someone from India knows far more about heat than a Canadian does.
I'm sorry for trying to second guess you. I thought > 100 was for checking that somehow what you get is a valid IP number. So please ignore it.
What I was hoping to say was, if the log file contains messages not related to IP addresses, you might filter them out and lessen the search since I thought /var/log/messages.07162004 might contain other entries too.
Very interesting thread and this data size will probably test the limits of many builds and distros. Please do post the final resolutionn even if it's a workaround.
It is an interesting thread. I have found a solution.
I rewrote the script in C. The differences are incredible.
In awk, it took about five minutes before it finished, or hung.
In C, the entire process is done in under ten seconds. At first, I didn't think it was working because it finished too fast. Imagine my surprise (and joy!) when I discovered that it was working, and absurdly fast at that.
As a side note, the first version in C choked on large log files as well. It didn't hang the machine, but I kept getting segmentation faults. I stored the IPs and counts in a linked list. I switched to a binary tree and that fixed it. I thought this was worth mentioning.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.