Trying to understand how awk uses two input files
Hi,
I am trying to understand the logic behind how using two input files for awk works, and I cannot find a good/simple explanation. I created a simple example with two input files and an expected output. Input file: filex A,bb,111,xxx,nnn A,cc,112,yyy,nnn A,dd,113,zzz,ppp Input file: filez 111,111 112,114 113,113 Output file: filey A,bb,111,xxx,nnn A,cc,114,yyy,nnn A,dd,113,zzz,ppp I wanted to check filex value in column3 against filex column1 and where colum1 was different put the value from column 2 from filez into the filey column3 location. This is the command I was playing with to try to understand how it worked, but I just kept getting back one or the other of my input files in the ouput file: filey awk 'BEGIN{OFS=FS=","}NR==FNR{a[$3]=$3;next}split(a[$1$2],b){$3=b[2]}1' filex filez > filey I would be grateful for an explanation of how this works/what it is doing to “read” the two files. I have found some complex examples online but none that really say what this is doing. I understand the BEGIN and OFS, FS, NR, FNR, and I know what next does. Appreciate your time to help a newbie up the learning curve. bop-a-nator |
BEGIN {OFS=FS=","} will set OFS and FS without reading input files
NR==FNR is a condition, will be true only if NR (The total number of input records seen so far) is equal to FNR (The input record number in the current input file). This will definitely true for the first file and false for the second. { a[$3]=$3; next } is a block, will be executed when NR==FNR (so for the first file), the meaning: store $3 in the array a and jump to the next line, therefore the following parts of the script will not be executed for the first input file but for the second. split(a[$1$2],b) is a condition again, $1 and $2 concatenated, and a[$1$2] will be splitted, the result is stored in array b. Returns true if a[$1$2] exists and can be splitted. { $3=b[2] } is a block again executed when the condition above is true, and will overwrite $3. 1 is a condition again, by default it is always true <empty block> is a block followed by that condition. By default the empty block will execute a simple print command. see man awk for further details. |
And in case you are not sure, it is pan64's explanation of "split" that you need to look into first :)
|
Thanks for breaking down the statement, that cleared a few things up.
Let me walk through how I “think” it works based on what you told me and maybe it will be more clear what I am not understanding. Input file: filex A,cc,112,yyy,nnn Input file: filez 112,114 Expected/wanted output file: filey A,cc,114,yyy,nnn awk 'BEGIN{OFS=FS=","}NR==FNR{a[$3]=$3;next}split(a[$1$2],b){$3=b[2]}1' filex filez > filey a[$3]=$3 this contains the value from filex: 112 split(a[$1$2],b) based on the split b contains value from filez: 114 Then $3=b[2] puts 114 into $3 So I don’t understand why when I run the awk command above I get this in filey: 112,114 Instead of what I think I should get, A,cc,114,yyy,nnn I am not sure if I not following what this is doing, or if I am I missing some piece of code or not defining something correctly? Thanks again, bop-a-nator |
Quote:
Your first assumption is correct that if our filex contains a single line then the array being set will look like: Code:
a[112] = 112 Then once you finish processing filex and move to filez, the split will look like: Code:
split(a[112114], b) hence $1 = 112 and $2 = 114. Hope that helps clear up why you are not getting the expected results. (*) - try reading the filez first and using something like: Code:
NR == FNR{a[$1] = $2;next} |
I can see what the split does:
awk 'BEGIN{OSF=FS=","}NR==FNR{split(a[$1$2],b)}{print $0}' filez 112,114 awk 'BEGIN{OSF=FS=","}NR==FNR{split(a[$1$2],b)}{print $2}' filez 114 My thought on how this works is that before the next I am coming up with something to match a pattern on between the two files right? So with what you explained that would give me this part of the statement: awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next} This seems like it would set a[$1] = 112 from filez So that is the field/value on what I am matching on, I follow that. So in the block following I am working filex - I think this is where I am getting confused about how it works. my thought was the next piece is pulling what matches between the two input files filex b[$3] (112) to filez $3 (112): {b[$3]=$3]} The the next piece reassigns what goes in to variable $3: {$3=a[2]} which is 114 awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}{b[$3]=$3}{$3=a[2]}1' filez filex > filey then I got this: cat filey A,cc,,yyy,nnn This seems like it should not be difficult, though something about it I am simply not "getting" |
You are correct it is not difficult, but we all start somewhere :)
Awk use the keyword 'in' to allow you to look at the indexes of an array, eg: Code:
5 in array{...} Also, just to backtrack on one of your assumptions: Quote:
Code:
a[112] = 114 Again I will leave you to it, but with the hint of using 'in' keyword on array 'a' and checking the index against part of filez |
Quote:
Code:
$ awk 'BEGIN { |
I think I got it:
input filez 111,111 112,114 113,113 input file filex A,bb,111,xxx,nnn A,cc,112,yyy,nnn A,dd,113,zzz,ppp awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next} $3 in a {print $1,$2,a[$3],$4,$5}' filez filex > filey output file: filey A,bb,111,xxx,nnn A,cc,114,yyy,nnn A,dd,113,zzz,ppp grail Thank You! Your hints sent me looking and researching to learn more! I will share that a co-worker looked at my data files and said "that's in dos format" and had me run a dos2unix on the files and that solved some of the problem I was having too. Thanks again, I certainly see how this works on the command line. Now to figure out how to write it into a script! bop-a-nator I will mark this as solved! |
Here is an alternative for you to consider:
Code:
awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}$3 in a && $3 = a[$3]' filez filex > filey |
All times are GMT -5. The time now is 10:31 PM. |