LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-20-2022, 05:51 AM   #1
ychaouche
Member
 
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 261
Blog Entries: 1

Rep: Reputation: 22
Modular awk code


Dear AWK lovers,

I received a spam today.
I had to run two different awk scripts to :
- get sender info (e-mail + sender's server)
- get detailed spam score per rule

Both are written in AWK.
I'd like to run a third script that would do both operations.
I thought about writing ashell script that calls both awk scripts, but that would read the file two times.
I'd like to find a solution that scans the file only once.
Is it possible while keeping the two original scripts?
 
Old 10-20-2022, 06:46 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,958

Rep: Reputation: 3173Reputation: 3173Reputation: 3173Reputation: 3173Reputation: 3173Reputation: 3173Reputation: 3173Reputation: 3173Reputation: 3173Reputation: 3173Reputation: 3173
You are already reading the file 2 times, how does this differ if you then use a script to do it?

The other solution would be to merge the 2 scripts so awk then only reads it once
 
Old 10-20-2022, 06:51 AM   #3
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 19,248

Rep: Reputation: 6525Reputation: 6525Reputation: 6525Reputation: 6525Reputation: 6525Reputation: 6525Reputation: 6525Reputation: 6525Reputation: 6525Reputation: 6525Reputation: 6525
Quote:
Originally Posted by ychaouche View Post
I'd like to find a solution that scans the file only once.
That is practically not possible. You are looking for two different things and that means you need to run both checks independently. What you can do is to read the file only once and run the two scanners (line by line?).
 
Old 10-20-2022, 07:18 AM   #4
ychaouche
Member
 
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 261

Original Poster
Blog Entries: 1

Rep: Reputation: 22
Quote:
You are already reading the file 2 times, how does this differ if you then use a script to do it?
Yes, this is suboptimal.
Ideally, I'd like to read the file just one time.
If I had to do it in, say, python,
I'd read the file one time,
store it in a buffer,
import first script as a module (with a single function in it),
import second script similarly,
call first function with lines stored in the buffer,
call second function with lines stored in the buffer.

This leaves me with 3 scripts:
2 specialized standalon scripts I can call independently on different occasions
1 script that usee code from the 2 specialized scripts on other occasions

Quote:
What you can do is to read the file only once and run the two scanners (line by line?).
Can you explain further?

Last edited by ychaouche; 10-20-2022 at 07:28 AM.
 
Old 10-20-2022, 09:00 AM   #5
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 2,983

Rep: Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120
Quote:
Originally Posted by ychaouche View Post
I received a spam today.
I had to run two different awk scripts to :
- get sender info (e-mail + sender's server)
- get detailed spam score per rule
Huh - are those not just different headers, so why do they need two different scripts in the first place?


Quote:
Ideally, I'd like to
...
import first script as a module (with a single function in it),
import second script similarly,
...
So did you look at either "awk --help" or man awk yet?

So long as you're using GNU Awk, you can include source files with functions or load extension libraries, both mentioned in the help and manpage and documented further in the GNU Awk User's Guide.

 
Old 10-20-2022, 09:13 AM   #6
ychaouche
Member
 
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 261

Original Poster
Blog Entries: 1

Rep: Reputation: 22
Quote:
Huh - are those not just different headers, so why do they need two different scripts in the first place?
Indeed, it's the same email, so same headers.
They are different scripts because the problem evolved that way.
I wrote first script some time ago because that was my only immediate need,
and second script after it,
so they were two scripts.

The problem I am tackling now is how do I combine those two?

Quote:
So long as you're using GNU Awk, you can include source files with functions or load extension libraries
What do you suggest?
convert script1 and script2 into functions written in a single file?
 
Old 10-20-2022, 09:34 AM   #7
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 2,983

Rep: Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120
Quote:
Originally Posted by ychaouche View Post
The problem I am tackling now is how do I combine those two?


What do you suggest?
convert script1 and script2 into functions written in a single file?
The description so far is too abstract to say, and depends on how you are splitting headers and values into records and fields.

If you're actually doing processing with the header values then maybe an email-funcs.awk would be tidier, but if all you're doing is printing then a single file might be sufficient.

 
Old 10-20-2022, 09:48 AM   #8
ychaouche
Member
 
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 261

Original Poster
Blog Entries: 1

Rep: Reputation: 22
Ok, here's some background
 
Old 10-20-2022, 09:53 AM   #9
ychaouche
Member
 
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 261

Original Poster
Blog Entries: 1

Rep: Reputation: 22
Woops! Here
 
Old 10-20-2022, 10:23 AM   #10
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 6,540
Blog Entries: 3

Rep: Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410
Please use [code] [/code] tags and post your work here so that it may be viewed safely. Thanks. Use one set of tags per script and then extras for any sample messages. Also consider that AWK might not be the best scripting language for this task since there are modules for Python and Perl which extract this information for you in a consistent manner.
 
1 members found this post helpful.
Old 10-20-2022, 10:29 AM   #11
ychaouche
Member
 
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 261

Original Poster
Blog Entries: 1

Rep: Reputation: 22
So, what was suggested on #bash by emanuele6 and vuipuh is the use of tee,
Code:
tee input >(script1) >(script2)
It seems to work, with the exception of output from script1 and script2 being written in parallel,
possibly causing the intermixing of the two.

So now the next problem I'd like to tackle is how to guarentee that the the output come in order,
first from script1, then from script2?

maybe capture the output from both scripts,
wait until they finish,
then write output1, followed by output2
 
Old 10-20-2022, 10:32 AM   #12
ychaouche
Member
 
Registered: Mar 2017
Distribution: Mint, Debian, Q4OS, Mageia, KDE Neon
Posts: 261

Original Poster
Blog Entries: 1

Rep: Reputation: 22
Quote:
Please use tags and post your work here so that it may be viewed safely. Thanks.
Ok, I'll copy/paste here.



I received a spam today at 09:55
I had to run two different awk scripts to get :
- one to get sender info (e-mail + sender's server)
- one to get detailed spam score per rule

Both are written in AWK.
I'd like to run a single script that would do both operations.
A shell script that calls both awk scripts would read the file two times.
I'm thinking of a solution that scans the file a single time.
Is it possible while having two separate awk files?

first script :

Code:
#!/usr/bin/gawk -f
# extract sender's e-mail, IP and original domain of the sending host, if any.

/^From:/ {from=$0} 
/Received:/ {recvd=$0}

END {
    print from "\n" recvd
}

second script
Code:
#!/usr/bin/gawk -f

/tests/ {
    tests=1; 
    sub(/tests=\[/,"");
} 

/Received:/ {tests=0} 

{
    if (tests) { 
	    # each test in its own line
	    gsub(/, /,"\n"); 
	    # remove preceding spaces and tabs
	    gsub(/[ \t]/,"");
	    # # remove autolearn=disabled after last rule.
	    gsub(/\].+/,"");
	    # print modified line
	    lines = lines $0 "\n"
	}
}


END {
    print lines;
}
Here's what I tried :
Code:
14:56:32 ~ -2- $ tee >(/home/ychaouche/SYNCHRO/mail.headers.sender.info) >(mail.headers.spam.rules.pretty)
[start paste]

Return-Path: <info2@krodaer.bar>                                      
Delivered-To: <a.chaouche@algerian-radio.dz>
Received: from messagerie.algerian-radio.dz
        by messagerie.algerian-radio.dz (Dovecot) with LMTP id SFqoOvsMUWNf7gAArJM0yg
        for <a.chaouche@algerian-radio.dz>; Thu, 20 OReturn-Path: <info2@krodaer.bar>
ct 2022 09:55:45 +0100
Received: from localhost (localhost [127.0.0.1])
        by messagerie.algerian-radio.dz (Postfix) with ESMTP id BA3E23A8009F
        for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:45 +0100 (CET)
X-Virus-Scanned: Debian amavisdDelivered-To: <a.chaouche@algerian-radio.dz>-new at messagerie.algerian-radio.dz
X-Spam-Flag: NO
X-Spam-Score: 3.698
X-Spam-Level: ***
X-Spam-Status: No, score=3.698 tagged_above=-999 required=5
        tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
        HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.1, SPF_HELO_NONE=0.001,
        SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_PHISH=3.696]
        autolearn=disabled
Received: from messagerie.algerian-radio.dz ([127.0.0.1])
        by localhost (messagerie.algerian-radio.dz. [127.0.0.1]) (amavisd-new, port 10024)
        with ESMTP id yqj7THlbuj7y for <a.chaouche@algerian-radio.dz>;
        Thu, 20 Oct 2022 09:55:45 +0100 (CET)
Received: from mail0.krodaer.bar (mail0.krodaer.bar [137.184.33.43])
        by messagerie.algerian-radio.dz (Postfix) with ESMTPS id E5DFF3A80097
        for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:44 +0100 (CET)
Authentication-Results: messagerie.algerian-radio.dz; dkim=pass
        reason="1024-bit key; unprotected key"
        header.d=krodaer.bar header.i=info2@krodaer.bar header.b=iYVKw8pZ;
        dk
im-adsp=pass; dkim-atps=neutral
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=default; d=krodaer.bar;
 h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type:
 Content-Transfer-Encoding; i=info2@krodaer.bar;
 bh=5cwpj0W1P6lQ1Y3J8/8IUq62NYReceived: from messagerie.algerian-radio.dz
1T2EF4V17aPnVkk+o=;
 b=iYVKw8pZXDuKwCEHRcZQSk0Pq8geeBYrIjFmJNIFX/8Nr/ObvIPLluUnHB3YLXFC8O1VyhxN+4Rh
   GAcghKY2mDy8uClhpWVuXK279GW7sB98JwQhm1ZWH7CEVeKwYu/LiQevcJ28WuPAU3xQ/gv43vbO
        by messagerie.algerian-radio.dz (Dovecot) with LMTP id SFqoOvsMUWNf7gAArJM0yg
  xoF30mTtohkOvGu0mZs=
From: algerian-radio.dz Cpanel<info2@krodaer.b  for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:45 +0100ar>
To: a.chaouche@algerian-radio.dz
Subject: Verify Your a.chaouche@algerian-radio.dz To Recover (9) Pending Emails`
Date: 20 Oct 2022 01:55:42 -0700
Message-ID: <20221020015542.55AFC8B0048AA646@krodaer.bar>
MIME-Version: 1.0
Content-Type: tex
t/html
Content-Transfer-Encoding: quoted-printableReceived: from localhost (localhost [127.0.0.1])
        by messagerie.algerian-radio.dz (Postfix) with ESMTP id BA3E23A8009F
        for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:45 +0100 (CET)
X-Virus-Scanned: Debian amavisd-new at messagerie.algerian-radio.dz
X-Spam-Flag: NO
X-Spam-Score: 3.698
X-Spam-Level: ***
X-Spam-Status: No, score=3.698 tagged_above=-999 required=5
        tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
        HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.1, SPF_HELO_NONE=0.001,
        SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_PHISH=3.696]
        autolearn=disabled
Received: from messagerie.algerian-radio.dz ([127.0.0.1])
        by localhost (messagerie.algerian-radio.dz. [127.0.0.1]) (amavisd-new, port 10024)
        with ESMTP id yqj7THlbuj7y for <a.chaouche@algerian-radio.dz>;
        Thu, 20 Oct 2022 09:55:45 +0100 (CET)
Received: from mail0.krodaer.bar (mail0.krodaer.bar [137.184.33.43])
        by messagerie.algerian-radio.dz (Postfix) with ESMTPS id E5DFF3A80097
        for <a.chaouche@algerian-radio.dz>; Thu, 20 Oct 2022 09:55:44 +0100 (CET)
Authentication-Results: messagerie.algerian-radio.dz; dkim=pass
        reason="1024-bit key; unprotected key"
        header.d=krodaer.bar header.i=info2@krodaer.bar header.b=iYVKw8pZ;
        dkim-adsp=pass; dkim-atps=neutral
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=default; d=krodaer.bar;
 h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type:
 Content-Transfer-Encoding; i=info2@krodaer.bar;
 bh=5cwpj0W1P6lQ1Y3J8/8IUq62NY1T2EF4V17aPnVkk+o=;
 b=iYVKw8pZXDuKwCEHRcZQSk0Pq8geeBYrIjFmJNIFX/8Nr/ObvIPLluUnHB3YLXFC8O1VyhxN+4Rh
   GAcghKY2mDy8uClhpWVuXK279GW7sB98JwQhm1ZWH7CEVeKwYu/LiQevcJ28WuPAU3xQ/gv43vbO
   xoF30mTtohkOvGu0mZs=
From: algerian-radio.dz Cpanel<info2@krodaer.bar>
To: a.chaouche@algerian-radio.dz
Subject: Verify Your a.chaouche@algerian-radio.dz To Recover (9) Pending Emails`
Date: 20 Oct 2022 01:55:42 -0700
Message-ID: <20221020015542.55AFC8B0048AA646@krodaer.bar>
MIME-Version: 1.0
Content-Type: text/html

Content-Transfer-Encoding: quoted-printable



[end paste]




DKIM_SIGNED=0.1
DKIM_VALID=-0.1
DKIM_VALID_AU=-0.1,
HTML_MESSAGE=0.001
MIME_HTML_ONLY=0.1
SPF_HELO_NONE=0.001,
SPF_PASS=-0.001
URIBL_BLOCKED=0.001
URI_PHISH=3.696]
autolearn=disabled

From: algerian-radio.dz Cpanel<info2@krodaer.bar>
Received: from mail0.krodaer.bar (mail0.krodaer.bar [137.184.33.43])
14:56:50 ~ -2- $
 
Old 10-20-2022, 11:04 AM   #13
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 6,540
Blog Entries: 3

Rep: Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410Reputation: 3410
Thanks. I would guess, something like this, per single message:

Code:
#!/usr/bin/gawk -f

# extract sender's e-mail, IP and original domain of the  
# sending host, if any. 

/^X-Spam/ {
    test=1; 
    xspam=xspam "\n" $0;
    next;   
}

/^[[:alpha:]]/ {
    test=0; 
}

test {
    xspam=xspam "\n" $0 
}

/^From:/ {'
    from=$0;
} 

/^Received:/ {
    recvd=$0;
}

/^$/ {
    exit;
}

END {
    sub(/^\n/, "", xspam); 
    print from;
    print recvd;
    print xspam;
}
However, why AWK and why not use Python's email.parser or CPAN's Mail::Box::Parser::Perl instead?

Last edited by Turbocapitalist; 10-20-2022 at 11:05 AM.
 
Old 10-20-2022, 11:09 AM   #14
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 2,983

Rep: Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120Reputation: 2120

I agree entirely with Turbocapitalist, using a language with an existing rfc5322-compliant library is the best way to approach this.

However, I was intrigued since it's not quite as simple as splitting on newlines and colon, and decided to write a script which works as follows:
Code:
$ awk -f email-headers.awk email-headers.txt
Spam Test Scores
DKIM_SIGNED=0.1
DKIM_VALID=-0.1
DKIM_VALID_AU=-0.1
HTML_MESSAGE=0.001
MIME_HTML_ONLY=0.1
SPF_HELO_NONE=0.001
SPF_PASS=-0.001
URIBL_BLOCKED=0.001
URI_PHISH=3.696

From
algerian-radio.dz Cpanel<info2@krodaer.bar>

Received
from mail0.krodaer.bar (mail0.krodaer.bar [137.184.33.43]) by messagerie.algerian-radio.dz (Postfix) with ESMTPS id E5DFF3A80097 for <user@example.com>; Thu, 20 Oct 2022 09:55:44 +0100 (CET)
(I've replaced what I assume is a real non-spammer email address in that - you might want to edit your post to do the same, otherwise you might get more spam to deal with.)

The script itself:
Code:
BEGIN {
   # Message header values can contain newline-whitespace,
   # so to handle this, split records via "newline-name-colon"
   # then extract the header name from RT variable
   # The ^ ensures the first row is blank for simpler logic

   RS = "(^|\r?\n)[A-Z][A-Za-z0-9\\-]+: ?"
   Header = ""

   if (Debug) print "DEBUG: Debug mode enabled"
}

function unfold(Value,WS)
{
   if ( WS == 0 )
   {
      # only remove newlines - as per RFC
      gsub(/\r?\n([ \t])/,"\1",Value)
   }
   else
   {
      # replace extra whitespace with single space
      gsub(/\r?\n[ \t]+/," ",Value)
   }

   return Value
}

Header != "" && Debug {
   print "DEBUG: header name ["Header"] value ["$0"]"
}

Header == "From" {
   print Header
   print $0
   print ""
}

Header == "Received" {
   LastReceived = unfold($0,1)
}
END {
   print "Received"
   print LastReceived
   print ""
}

Header == "X-Spam-Status" {
   print "Spam Test Scores"
   Value = unfold($0,1)
   if (Debug) print "DEBUG:" Value

   match(Value,/tests=\[([^\]]+)/,Matched)
   split(Matched[1],Scores,/, /)
   for (Score in Scores)
      print Scores[Score]
   print ""
}


# the following rule must always be executed last
# (so if "next" is used, this must also go before it)
{ Header=RT; gsub("[\n: ]","",Header) }
 
1 members found this post helpful.
  


Reply

Tags
awk


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] sed inside awk or awk inside awk maddyfreaks Linux - Newbie 4 06-29-2016 02:10 PM
[SOLVED] Once again... awk.. awk... awk shivaa Linux - Newbie 13 12-31-2012 05:56 AM
How to write modular and commented code? coolguy_iiit Programming 2 10-07-2004 02:05 PM
system call implementation.. modular approach udayan Programming 0 05-13-2002 12:46 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:09 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration