Need help with Awk code explaination

noob555 · 04-22-2022, 05:59 PM

hello,

I found this awk code in a forum but the author offered no explanation on how it works. The code works by printing lines matched from files a.txt and b.txt. For example:

a.txt
one two three
cat dog bird
a b c

b.txt
one two three
cat dog bird
c b a
123 456 789

Executing the script, I get:

./awktest
one two three
cat dog bird
c b a

Code:

#!/bin/bash

awk '
NR==FNR         {a[$0]
                 next
                }
                {for (i in a)   {m=split (i, b, " ")
                                 for (j=1; ($0 ~ b[j]) && j<=m; j++);
                                 if (j > m)     {print
                                                 next
                                                }
                                }
                }
' a.txt b.txt

What are the lines in red mean? More importantly the purple line. I appreciate any feedback. Thanks

astrogeek · 04-22-2022, 07:06 PM

Welcome to LQ and the Programming forum!

The first thing I would ask is what are you expecting it to do?

To answer your specific questions, the lines in red test each line in b.txt to see if it contains each word of some line in a.txt (all of which have been entered into an array named a[]).

The part in purple:

Code:

($0 ~ b[j]) && j<=m;

... is the loop test which returns true (1) if both conditions are true:

Code:

($0 ~ b[j]) Tests if b[j] (i.e. some word from a line in a[])
            is also in the current line from b.txt (i.e. $0)
j<=m        Tests that there are not more words in the line from a.txt
            than there are in the line from b.txt

Skaperen · 04-22-2022, 07:54 PM

hmmm. i see no purple. the code that @astrogeek quoted shows up as very slightly brighter red.

noob555 · 04-22-2022, 10:00 PM

@astrogeek

Hi, astrogeek. Thanks for the warm welcome.

I understand it a little bit better. But need to do more research on it.

What search terms of awk should I use for research to understand the author's code better?

Thanks in advance.

syg00 · 04-22-2022, 10:55 PM

Not all awk are created equal - I prefer GNU awk (gawk) for the extensions implemented. This may lead to one creating non-POSIX code - but for me that's not a concern. I find the gawk effective programming guide very handy - download it here.

Very expansive - not just a reference manual for the language.

grail · 04-22-2022, 11:31 PM

The (g)awk specific items to look up would be:

NR
FNR
next
split

for loops and if's behave the same as most other languages

astrogeek · 04-23-2022, 12:38 AM

Quote:

Originally Posted by noob555

@astrogeek

Hi, astrogeek. Thanks for the warm welcome.

I understand it a little bit better. But need to do more research on it.

What search terms of awk should I use for research to understand the author's code better?

Thanks in advance.

You are welcome!

As already noted by others, terms for further search would include NR, FNR, next and split. You should add regular expressions and awk operators to that list.

The GNU gawk programming guide linked above is an excellent and authoritative reference!

Although not the best tutorial to start with, once you understand the basics of how awk iterates over a file, the man and info pages, man awk and info awk, provide a very complete quick reference.

For example, to understand the ~ operator (regular expression match), in man awk you will find this:

Code:

   Operators
       The operators in AWK, in order of decreasing precedence, are:

       ...

       ~ !~        Regular expression match, negated match.  NOTE: Do not use a constant regular  expres‐
                   sion (/foo/) on the left-hand side of a ~ or !~.  Only use one on the right-hand side.
                   The expression /foo/ ~ exp has the same meaning as (($0 ~ /foo/) ~ exp).  This is usu‐
                   ally not what you want.

Hope that helps!

MadeInGermany · 04-24-2022, 08:06 AM

awk loops over the lines of the given input files; the awk code runs on each input line.

NR==FNR
true if the first file (a.txt) is read.
a[$0]
stores the input line (from a.txt) in an array as index (no value)
next
skips the following code (continue with the next input cycle)
The following code runs for each line from the remaining file (b.txt)
for (i in a)
loops thru the indexes of array a
each i is one line from a.txt
m=split (i, b, " ")
splits string i into elements that are stored in array b (as values)
m is the number of array members
for (j=1; ($0 ~ b[j]) && j<=m; j++)
loops thru the array b but can also stop if the first condition is not met
j can be 1...3
b[j] is the value
($0 ~ b[j])
true if value is in the current line (matched)
if (j > m)
true if every item matched

Note that the ~ operator matches everywhere. For example
"b" matches in "c bx a"
If you want an exact field match then have another loop that cycles through the current input fields $1...3 and compare with the == operator.

noob555 · 04-24-2022, 10:50 PM

Quote:

Originally Posted by grail

The (g)awk specific items to look up would be:

NR
FNR
next
split

for loops and if's behave the same as most other languages

Quote:

Originally Posted by astrogeek

You are welcome!

As already noted by others, terms for further search would include NR, FNR, next and split. You should add regular expressions and awk operators to that list.

The GNU gawk programming guide linked above is an excellent and authoritative reference!

Although not the best tutorial to start with, once you understand the basics of how awk iterates over a file, the man and info pages, man awk and info awk, provide a very complete quick reference.

For example, to understand the ~ operator (regular expression match), in man awk you will find this:

Code:

   Operators
       The operators in AWK, in order of decreasing precedence, are:

       ...

       ~ !~        Regular expression match, negated match.  NOTE: Do not use a constant regular  expres‐
                   sion (/foo/) on the left-hand side of a ~ or !~.  Only use one on the right-hand side.
                   The expression /foo/ ~ exp has the same meaning as (($0 ~ /foo/) ~ exp).  This is usu‐
                   ally not what you want.

Hope that helps!

Quote:

Originally Posted by MadeInGermany

awk loops over the lines of the given input files; the awk code runs on each input line.

NR==FNR
true if the first file (a.txt) is read.
a[$0]
stores the input line (from a.txt) in an array as index (no value)
next
skips the following code (continue with the next input cycle)
The following code runs for each line from the remaining file (b.txt)
for (i in a)
loops thru the indexes of array a
each i is one line from a.txt
m=split (i, b, " ")
splits string i into elements that are stored in array b (as values)
m is the number of array members
for (j=1; ($0 ~ b[j]) && j<=m; j++)
loops thru the array b but can also stop if the first condition is not met
j can be 1...3
b[j] is the value
($0 ~ b[j])
true if value is in the current line (matched)
if (j > m)
true if every item matched

Note that the ~ operator matches everywhere. For example
"b" matches in "c bx a"
If you want an exact field match then have another loop that cycles through the current input fields $1...3 and compare with the == operator.

Thank you all. it's much more clearer now. Everyone who replied is awesome.