LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Trying to understand how awk uses two input files (https://www.linuxquestions.org/questions/linux-newbie-8/trying-to-understand-how-awk-uses-two-input-files-4175444429/)

bop-a-nator 01-06-2013 09:50 PM

Trying to understand how awk uses two input files
 
Hi,

I am trying to understand the logic behind how using two input files for awk works, and I cannot find a good/simple explanation.

I created a simple example with two input files and an expected output.

Input file: filex
A,bb,111,xxx,nnn
A,cc,112,yyy,nnn
A,dd,113,zzz,ppp

Input file: filez
111,111
112,114
113,113

Output file: filey
A,bb,111,xxx,nnn
A,cc,114,yyy,nnn
A,dd,113,zzz,ppp

I wanted to check filex value in column3 against filex column1 and where colum1 was different put the value from column 2 from filez into the filey column3 location.

This is the command I was playing with to try to understand how it worked, but I just kept getting back one or the other of my input files in the ouput file: filey

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$3]=$3;next}split(a[$1$2],b){$3=b[2]}1' filex filez > filey

I would be grateful for an explanation of how this works/what it is doing to “read” the two files. I have found some complex examples online but none that really say what this is doing. I understand the BEGIN and OFS, FS, NR, FNR, and I know what next does.

Appreciate your time to help a newbie up the learning curve.
bop-a-nator

pan64 01-07-2013 05:01 AM

BEGIN {OFS=FS=","} will set OFS and FS without reading input files
NR==FNR is a condition, will be true only if NR (The total number of input records seen so far) is equal to FNR (The input record number in the current input file). This will definitely true for the first file and false for the second.
{ a[$3]=$3; next } is a block, will be executed when NR==FNR (so for the first file), the meaning: store $3 in the array a and jump to the next line, therefore the following parts of the script will not be executed for the first input file but for the second.
split(a[$1$2],b) is a condition again, $1 and $2 concatenated, and a[$1$2] will be splitted, the result is stored in array b. Returns true if a[$1$2] exists and can be splitted.
{ $3=b[2] } is a block again executed when the condition above is true, and will overwrite $3.
1 is a condition again, by default it is always true
<empty block> is a block followed by that condition. By default the empty block will execute a simple print command.

see man awk for further details.

grail 01-07-2013 07:22 AM

And in case you are not sure, it is pan64's explanation of "split" that you need to look into first :)

bop-a-nator 01-07-2013 10:42 AM

Thanks for breaking down the statement, that cleared a few things up.

Let me walk through how I “think” it works based on what you told me and maybe it will be more clear what I am not understanding.

Input file: filex
A,cc,112,yyy,nnn

Input file: filez
112,114

Expected/wanted output file: filey
A,cc,114,yyy,nnn

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$3]=$3;next}split(a[$1$2],b){$3=b[2]}1' filex filez > filey

a[$3]=$3 this contains the value from filex: 112

split(a[$1$2],b) based on the split b contains value from filez: 114

Then $3=b[2] puts 114 into $3

So I don’t understand why when I run the awk command above I get this in filey:
112,114

Instead of what I think I should get,
A,cc,114,yyy,nnn

I am not sure if I not following what this is doing, or if I am I missing some piece of code or not defining something correctly?

Thanks again,
bop-a-nator

grail 01-07-2013 12:06 PM

Quote:

split(a[$1$2],b) based on the split b contains value from filez: 114
No this is the incorrect assumption (which I was cryptically trying to point out before)

Your first assumption is correct that if our filex contains a single line then the array being set will look like:
Code:

a[112] = 112
You may also notice that this line doesn't really help us much, but I will come back to that(*).

Then once you finish processing filex and move to filez, the split will look like:
Code:

split(a[112114], b)
You can probably see from this that the 'a' array has no such index. Also, the concept of what you thought was happening would be mute as FS is set to a comma
hence $1 = 112 and $2 = 114.

Hope that helps clear up why you are not getting the expected results.

(*) - try reading the filez first and using something like:
Code:

NR == FNR{a[$1] = $2;next}
I will let you write the rest ;)

bop-a-nator 01-07-2013 02:35 PM

I can see what the split does:

awk 'BEGIN{OSF=FS=","}NR==FNR{split(a[$1$2],b)}{print $0}' filez
112,114


awk 'BEGIN{OSF=FS=","}NR==FNR{split(a[$1$2],b)}{print $2}' filez
114


My thought on how this works is that before the next I am coming up with something to match a pattern on between the two files right?


So with what you explained that would give me this part of the statement:

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}

This seems like it would set a[$1] = 112 from filez

So that is the field/value on what I am matching on, I follow that.

So in the block following I am working filex - I think this is where I am getting confused about how it works.

my thought was the next piece is pulling what matches between the two input files filex b[$3] (112) to filez $3 (112): {b[$3]=$3]}

The the next piece reassigns what goes in to variable $3: {$3=a[2]} which is 114

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}{b[$3]=$3}{$3=a[2]}1' filez filex > filey

then I got this:

cat filey
A,cc,,yyy,nnn

This seems like it should not be difficult, though something about it I am simply not "getting"

grail 01-08-2013 01:11 AM

You are correct it is not difficult, but we all start somewhere :)

Awk use the keyword 'in' to allow you to look at the indexes of an array, eg:
Code:

5 in array{...}
This will perform the task in the braces should the number 5 be an index of the array.

Also, just to backtrack on one of your assumptions:
Quote:

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}

This seems like it would set a[$1] = 112 from filez
What it will actually set in array 'a' is:
Code:

a[112] = 114
So here we have $1 = 112 which is the index of the array and $2 = 114 which is the value stored at this index.

Again I will leave you to it, but with the hint of using 'in' keyword on array 'a' and checking the index against part of filez

rknichols 01-08-2013 10:22 AM

Quote:

Originally Posted by grail (Post 4865192)
Awk use the keyword 'in' to allow you to look at the indexes of an array, eg:
Code:

5 in array{...}
This will perform the task in the braces should the number 5 be an index of the array.

You have to be a bit careful when using the "in" keyword. Simply referencing a variable or array element is sufficient to create it, so:
Code:

$ awk 'BEGIN {
    if(112 in aa) print "oops"; else print "not there yet"
    if(aa[112] == 114) print "is equal"; else print "not equal"
    if(112 in aa) print "now it is there, aa[112]=\"" aa[112] "\""
    exit 0
}'
not there yet
not equal
now it is there, aa[112]=""

Just testing the value of element "112" in the array caused that element to be created with a null value.

bop-a-nator 01-09-2013 01:13 PM

I think I got it:

input filez
111,111
112,114
113,113

input file filex
A,bb,111,xxx,nnn
A,cc,112,yyy,nnn
A,dd,113,zzz,ppp

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next} $3 in a {print $1,$2,a[$3],$4,$5}' filez filex > filey

output file: filey
A,bb,111,xxx,nnn
A,cc,114,yyy,nnn
A,dd,113,zzz,ppp


grail Thank You! Your hints sent me looking and researching to learn more!

I will share that a co-worker looked at my data files and said "that's in dos format" and had me run a dos2unix on the files and that solved some of the problem I was having too.

Thanks again, I certainly see how this works on the command line. Now to figure out how to write it into a script!
bop-a-nator
I will mark this as solved!

grail 01-10-2013 01:52 AM

Here is an alternative for you to consider:
Code:

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}$3 in a && $3 = a[$3]' filez filex > filey


All times are GMT -5. The time now is 10:31 PM.