Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
10-20-2022, 04:51 AM
|
#1
|
Member
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 383
Rep:
|
Modular awk code
Dear AWK lovers,
I received a spam today.
I had to run two different awk scripts to :
- get sender info (e-mail + sender's server)
- get detailed spam score per rule
Both are written in AWK.
I'd like to run a third script that would do both operations.
I thought about writing ashell script that calls both awk scripts, but that would read the file two times.
I'd like to find a solution that scans the file only once.
Is it possible while keeping the two original scripts?
|
|
|
10-20-2022, 05:46 AM
|
#2
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,018
|
You are already reading the file 2 times, how does this differ if you then use a script to do it?
The other solution would be to merge the 2 scripts so awk then only reads it once
|
|
|
10-20-2022, 05:51 AM
|
#3
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,716
|
Quote:
Originally Posted by ychaouche
I'd like to find a solution that scans the file only once.
|
That is practically not possible. You are looking for two different things and that means you need to run both checks independently. What you can do is to read the file only once and run the two scanners (line by line?).
|
|
|
10-20-2022, 06:18 AM
|
#4
|
Member
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 383
Original Poster
Rep:
|
Quote:
You are already reading the file 2 times, how does this differ if you then use a script to do it?
|
Yes, this is suboptimal.
Ideally, I'd like to read the file just one time.
If I had to do it in, say, python,
I'd read the file one time,
store it in a buffer,
import first script as a module (with a single function in it),
import second script similarly,
call first function with lines stored in the buffer,
call second function with lines stored in the buffer.
This leaves me with 3 scripts:
2 specialized standalon scripts I can call independently on different occasions
1 script that usee code from the 2 specialized scripts on other occasions
Quote:
What you can do is to read the file only once and run the two scanners (line by line?).
|
Can you explain further?
Last edited by ychaouche; 10-20-2022 at 06:28 AM.
|
|
|
10-20-2022, 08:00 AM
|
#5
|
Senior Member
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,722
|
Quote:
Originally Posted by ychaouche
I received a spam today.
I had to run two different awk scripts to :
- get sender info (e-mail + sender's server)
- get detailed spam score per rule
|
Huh - are those not just different headers, so why do they need two different scripts in the first place?
Quote:
Ideally, I'd like to
...
import first script as a module (with a single function in it),
import second script similarly,
...
|
So did you look at either " awk --help" or man awk yet?
So long as you're using GNU Awk, you can include source files with functions or load extension libraries, both mentioned in the help and manpage and documented further in the GNU Awk User's Guide.
|
|
|
10-20-2022, 08:13 AM
|
#6
|
Member
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 383
Original Poster
Rep:
|
Quote:
Huh - are those not just different headers, so why do they need two different scripts in the first place?
|
Indeed, it's the same email, so same headers.
They are different scripts because the problem evolved that way.
I wrote first script some time ago because that was my only immediate need,
and second script after it,
so they were two scripts.
The problem I am tackling now is how do I combine those two?
Quote:
So long as you're using GNU Awk, you can include source files with functions or load extension libraries
|
What do you suggest?
convert script1 and script2 into functions written in a single file?
|
|
|
10-20-2022, 08:34 AM
|
#7
|
Senior Member
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,722
|
Quote:
Originally Posted by ychaouche
The problem I am tackling now is how do I combine those two?
What do you suggest?
convert script1 and script2 into functions written in a single file?
|
The description so far is too abstract to say, and depends on how you are splitting headers and values into records and fields.
If you're actually doing processing with the header values then maybe an email-funcs.awk would be tidier, but if all you're doing is printing then a single file might be sufficient.
|
|
|
10-20-2022, 08:48 AM
|
#8
|
Member
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 383
Original Poster
Rep:
|
Ok, here's some background
|
|
|
10-20-2022, 08:53 AM
|
#9
|
Member
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 383
Original Poster
Rep:
|
|
|
|
10-20-2022, 09:23 AM
|
#10
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,517
|
Please use [code] [/code] tags and post your work here so that it may be viewed safely. Thanks. Use one set of tags per script and then extras for any sample messages. Also consider that AWK might not be the best scripting language for this task since there are modules for Python and Perl which extract this information for you in a consistent manner.
|
|
1 members found this post helpful.
|
10-20-2022, 09:29 AM
|
#11
|
Member
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 383
Original Poster
Rep:
|
So, what was suggested on #bash by emanuele6 and vuipuh is the use of tee,
Code:
tee input >(script1) >(script2)
It seems to work, with the exception of output from script1 and script2 being written in parallel,
possibly causing the intermixing of the two.
So now the next problem I'd like to tackle is how to guarentee that the the output come in order,
first from script1, then from script2?
maybe capture the output from both scripts,
wait until they finish,
then write output1, followed by output2
|
|
|
10-20-2022, 09:32 AM
|
#12
|
Member
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 383
Original Poster
Rep:
|
Quote:
Please use tags and post your work here so that it may be viewed safely. Thanks.
|
Ok, I'll copy/paste here.
I received a spam today at 09:55
I had to run two different awk scripts to get :
- one to get sender info (e-mail + sender's server)
- one to get detailed spam score per rule
Both are written in AWK.
I'd like to run a single script that would do both operations.
A shell script that calls both awk scripts would read the file two times.
I'm thinking of a solution that scans the file a single time.
Is it possible while having two separate awk files?
first script :
Code:
#!/usr/bin/gawk -f
# extract sender's e-mail, IP and original domain of the sending host, if any.
/^From:/ {from=$0}
/Received:/ {recvd=$0}
END {
print from "\n" recvd
}
second script
Code:
#!/usr/bin/gawk -f
/tests/ {
tests=1;
sub(/tests=\[/,"");
}
/Received:/ {tests=0}
{
if (tests) {
# each test in its own line
gsub(/, /,"\n");
# remove preceding spaces and tabs
gsub(/[ \t]/,"");
# # remove autolearn=disabled after last rule.
gsub(/\].+/,"");
# print modified line
lines = lines $0 "\n"
}
}
END {
print lines;
}
Here's what I tried :
Code:
14:56:32 ~ -2- $ tee >(/home/ychaouche/SYNCHRO/mail.headers.sender.info) >(mail.headers.spam.rules.pretty)
[start paste]
Return-Path: <info2@krodaer.bar>
Delivered-To: <a.chaouche@algerian-radio.dz>
Received: from messagerie.algerian-radio.dz
by messagerie.algerian-radio.dz (Dovecot) with LMTP id SFqoOvsMUWNf7gAArJM0yg
for <a.chaouche@algerian-radio.dz>; Thu, 20 OReturn-Path: <info2@krodaer.bar>
ct 2022 09:55:45 +0100
Received: from localhost (localhost [127.0.0.1])
by messagerie.algerian-radio.dz (Postfix) with ESMTP id BA3E23A8009F
for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:45 +0100 (CET)
X-Virus-Scanned: Debian amavisdDelivered-To: <a.chaouche@algerian-radio.dz>-new at messagerie.algerian-radio.dz
X-Spam-Flag: NO
X-Spam-Score: 3.698
X-Spam-Level: ***
X-Spam-Status: No, score=3.698 tagged_above=-999 required=5
tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.1, SPF_HELO_NONE=0.001,
SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_PHISH=3.696]
autolearn=disabled
Received: from messagerie.algerian-radio.dz ([127.0.0.1])
by localhost (messagerie.algerian-radio.dz. [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id yqj7THlbuj7y for <a.chaouche@algerian-radio.dz>;
Thu, 20 Oct 2022 09:55:45 +0100 (CET)
Received: from mail0.krodaer.bar (mail0.krodaer.bar [137.184.33.43])
by messagerie.algerian-radio.dz (Postfix) with ESMTPS id E5DFF3A80097
for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:44 +0100 (CET)
Authentication-Results: messagerie.algerian-radio.dz; dkim=pass
reason="1024-bit key; unprotected key"
header.d=krodaer.bar header.i=info2@krodaer.bar header.b=iYVKw8pZ;
dk
im-adsp=pass; dkim-atps=neutral
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=default; d=krodaer.bar;
h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type:
Content-Transfer-Encoding; i=info2@krodaer.bar;
bh=5cwpj0W1P6lQ1Y3J8/8IUq62NYReceived: from messagerie.algerian-radio.dz
1T2EF4V17aPnVkk+o=;
b=iYVKw8pZXDuKwCEHRcZQSk0Pq8geeBYrIjFmJNIFX/8Nr/ObvIPLluUnHB3YLXFC8O1VyhxN+4Rh
GAcghKY2mDy8uClhpWVuXK279GW7sB98JwQhm1ZWH7CEVeKwYu/LiQevcJ28WuPAU3xQ/gv43vbO
by messagerie.algerian-radio.dz (Dovecot) with LMTP id SFqoOvsMUWNf7gAArJM0yg
xoF30mTtohkOvGu0mZs=
From: algerian-radio.dz Cpanel<info2@krodaer.b for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:45 +0100ar>
To: a.chaouche@algerian-radio.dz
Subject: Verify Your a.chaouche@algerian-radio.dz To Recover (9) Pending Emails`
Date: 20 Oct 2022 01:55:42 -0700
Message-ID: <20221020015542.55AFC8B0048AA646@krodaer.bar>
MIME-Version: 1.0
Content-Type: tex
t/html
Content-Transfer-Encoding: quoted-printableReceived: from localhost (localhost [127.0.0.1])
by messagerie.algerian-radio.dz (Postfix) with ESMTP id BA3E23A8009F
for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:45 +0100 (CET)
X-Virus-Scanned: Debian amavisd-new at messagerie.algerian-radio.dz
X-Spam-Flag: NO
X-Spam-Score: 3.698
X-Spam-Level: ***
X-Spam-Status: No, score=3.698 tagged_above=-999 required=5
tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.1, SPF_HELO_NONE=0.001,
SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_PHISH=3.696]
autolearn=disabled
Received: from messagerie.algerian-radio.dz ([127.0.0.1])
by localhost (messagerie.algerian-radio.dz. [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id yqj7THlbuj7y for <a.chaouche@algerian-radio.dz>;
Thu, 20 Oct 2022 09:55:45 +0100 (CET)
Received: from mail0.krodaer.bar (mail0.krodaer.bar [137.184.33.43])
by messagerie.algerian-radio.dz (Postfix) with ESMTPS id E5DFF3A80097
for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:44 +0100 (CET)
Authentication-Results: messagerie.algerian-radio.dz; dkim=pass
reason="1024-bit key; unprotected key"
header.d=krodaer.bar header.i=info2@krodaer.bar header.b=iYVKw8pZ;
dkim-adsp=pass; dkim-atps=neutral
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=default; d=krodaer.bar;
h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type:
Content-Transfer-Encoding; i=info2@krodaer.bar;
bh=5cwpj0W1P6lQ1Y3J8/8IUq62NY1T2EF4V17aPnVkk+o=;
b=iYVKw8pZXDuKwCEHRcZQSk0Pq8geeBYrIjFmJNIFX/8Nr/ObvIPLluUnHB3YLXFC8O1VyhxN+4Rh
GAcghKY2mDy8uClhpWVuXK279GW7sB98JwQhm1ZWH7CEVeKwYu/LiQevcJ28WuPAU3xQ/gv43vbO
xoF30mTtohkOvGu0mZs=
From: algerian-radio.dz Cpanel<info2@krodaer.bar>
To: a.chaouche@algerian-radio.dz
Subject: Verify Your a.chaouche@algerian-radio.dz To Recover (9) Pending Emails`
Date: 20 Oct 2022 01:55:42 -0700
Message-ID: <20221020015542.55AFC8B0048AA646@krodaer.bar>
MIME-Version: 1.0
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable
[end paste]
DKIM_SIGNED=0.1
DKIM_VALID=-0.1
DKIM_VALID_AU=-0.1,
HTML_MESSAGE=0.001
MIME_HTML_ONLY=0.1
SPF_HELO_NONE=0.001,
SPF_PASS=-0.001
URIBL_BLOCKED=0.001
URI_PHISH=3.696]
autolearn=disabled
From: algerian-radio.dz Cpanel<info2@krodaer.bar>
Received: from mail0.krodaer.bar (mail0.krodaer.bar [137.184.33.43])
14:56:50 ~ -2- $
|
|
|
10-20-2022, 10:04 AM
|
#13
|
LQ Guru
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,517
|
Thanks. I would guess, something like this, per single message:
Code:
#!/usr/bin/gawk -f
# extract sender's e-mail, IP and original domain of the
# sending host, if any.
/^X-Spam/ {
test=1;
xspam=xspam "\n" $0;
next;
}
/^[[:alpha:]]/ {
test=0;
}
test {
xspam=xspam "\n" $0
}
/^From:/ {'
from=$0;
}
/^Received:/ {
recvd=$0;
}
/^$/ {
exit;
}
END {
sub(/^\n/, "", xspam);
print from;
print recvd;
print xspam;
}
However, why AWK and why not use Python's email.parser or CPAN's Mail::Box::Parser::Perl instead?
Last edited by Turbocapitalist; 10-20-2022 at 10:05 AM.
|
|
|
10-20-2022, 10:09 AM
|
#14
|
Senior Member
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,722
|
I agree entirely with Turbocapitalist, using a language with an existing rfc5322-compliant library is the best way to approach this.
However, I was intrigued since it's not quite as simple as splitting on newlines and colon, and decided to write a script which works as follows:
Code:
$ awk -f email-headers.awk email-headers.txt
Spam Test Scores
DKIM_SIGNED=0.1
DKIM_VALID=-0.1
DKIM_VALID_AU=-0.1
HTML_MESSAGE=0.001
MIME_HTML_ONLY=0.1
SPF_HELO_NONE=0.001
SPF_PASS=-0.001
URIBL_BLOCKED=0.001
URI_PHISH=3.696
From
algerian-radio.dz Cpanel<info2@krodaer.bar>
Received
from mail0.krodaer.bar (mail0.krodaer.bar [137.184.33.43]) by messagerie.algerian-radio.dz (Postfix) with ESMTPS id E5DFF3A80097 for <user@example.com>; Thu, 20 Oct 2022 09:55:44 +0100 (CET)
(I've replaced what I assume is a real non-spammer email address in that - you might want to edit your post to do the same, otherwise you might get more spam to deal with.)
The script itself:
Code:
BEGIN {
# Message header values can contain newline-whitespace,
# so to handle this, split records via "newline-name-colon"
# then extract the header name from RT variable
# The ^ ensures the first row is blank for simpler logic
RS = "(^|\r?\n)[A-Z][A-Za-z0-9\\-]+: ?"
Header = ""
if (Debug) print "DEBUG: Debug mode enabled"
}
function unfold(Value,WS)
{
if ( WS == 0 )
{
# only remove newlines - as per RFC
gsub(/\r?\n([ \t])/,"\1",Value)
}
else
{
# replace extra whitespace with single space
gsub(/\r?\n[ \t]+/," ",Value)
}
return Value
}
Header != "" && Debug {
print "DEBUG: header name ["Header"] value ["$0"]"
}
Header == "From" {
print Header
print $0
print ""
}
Header == "Received" {
LastReceived = unfold($0,1)
}
END {
print "Received"
print LastReceived
print ""
}
Header == "X-Spam-Status" {
print "Spam Test Scores"
Value = unfold($0,1)
if (Debug) print "DEBUG:" Value
match(Value,/tests=\[([^\]]+)/,Matched)
split(Matched[1],Scores,/, /)
for (Score in Scores)
print Scores[Score]
print ""
}
# the following rule must always be executed last
# (so if "next" is used, this must also go before it)
{ Header=RT; gsub("[\n: ]","",Header) }
|
|
1 members found this post helpful.
|
All times are GMT -5. The time now is 06:13 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|