LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-06-2013, 09:50 PM   #1
bop-a-nator
LQ Newbie
 
Registered: Sep 2012
Location: North East USA
Distribution: at work: Red Hat Enterprise Linux Server release 5.8 (Tikanga); at home: what do you recommend?
Posts: 24

Rep: Reputation: Disabled
Trying to understand how awk uses two input files


Hi,

I am trying to understand the logic behind how using two input files for awk works, and I cannot find a good/simple explanation.

I created a simple example with two input files and an expected output.

Input file: filex
A,bb,111,xxx,nnn
A,cc,112,yyy,nnn
A,dd,113,zzz,ppp

Input file: filez
111,111
112,114
113,113

Output file: filey
A,bb,111,xxx,nnn
A,cc,114,yyy,nnn
A,dd,113,zzz,ppp

I wanted to check filex value in column3 against filex column1 and where colum1 was different put the value from column 2 from filez into the filey column3 location.

This is the command I was playing with to try to understand how it worked, but I just kept getting back one or the other of my input files in the ouput file: filey

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$3]=$3;next}split(a[$1$2],b){$3=b[2]}1' filex filez > filey

I would be grateful for an explanation of how this works/what it is doing to “read” the two files. I have found some complex examples online but none that really say what this is doing. I understand the BEGIN and OFS, FS, NR, FNR, and I know what next does.

Appreciate your time to help a newbie up the learning curve.
bop-a-nator
 
Old 01-07-2013, 05:01 AM   #2
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 9,345

Rep: Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747
BEGIN {OFS=FS=","} will set OFS and FS without reading input files
NR==FNR is a condition, will be true only if NR (The total number of input records seen so far) is equal to FNR (The input record number in the current input file). This will definitely true for the first file and false for the second.
{ a[$3]=$3; next } is a block, will be executed when NR==FNR (so for the first file), the meaning: store $3 in the array a and jump to the next line, therefore the following parts of the script will not be executed for the first input file but for the second.
split(a[$1$2],b) is a condition again, $1 and $2 concatenated, and a[$1$2] will be splitted, the result is stored in array b. Returns true if a[$1$2] exists and can be splitted.
{ $3=b[2] } is a block again executed when the condition above is true, and will overwrite $3.
1 is a condition again, by default it is always true
<empty block> is a block followed by that condition. By default the empty block will execute a simple print command.

see man awk for further details.
 
Old 01-07-2013, 07:22 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
And in case you are not sure, it is pan64's explanation of "split" that you need to look into first
 
Old 01-07-2013, 10:42 AM   #4
bop-a-nator
LQ Newbie
 
Registered: Sep 2012
Location: North East USA
Distribution: at work: Red Hat Enterprise Linux Server release 5.8 (Tikanga); at home: what do you recommend?
Posts: 24

Original Poster
Rep: Reputation: Disabled
Thanks for breaking down the statement, that cleared a few things up.

Let me walk through how I “think” it works based on what you told me and maybe it will be more clear what I am not understanding.

Input file: filex
A,cc,112,yyy,nnn

Input file: filez
112,114

Expected/wanted output file: filey
A,cc,114,yyy,nnn

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$3]=$3;next}split(a[$1$2],b){$3=b[2]}1' filex filez > filey

a[$3]=$3 this contains the value from filex: 112

split(a[$1$2],b) based on the split b contains value from filez: 114

Then $3=b[2] puts 114 into $3

So I don’t understand why when I run the awk command above I get this in filey:
112,114

Instead of what I think I should get,
A,cc,114,yyy,nnn

I am not sure if I not following what this is doing, or if I am I missing some piece of code or not defining something correctly?

Thanks again,
bop-a-nator
 
Old 01-07-2013, 12:06 PM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Quote:
split(a[$1$2],b) based on the split b contains value from filez: 114
No this is the incorrect assumption (which I was cryptically trying to point out before)

Your first assumption is correct that if our filex contains a single line then the array being set will look like:
Code:
a[112] = 112
You may also notice that this line doesn't really help us much, but I will come back to that(*).

Then once you finish processing filex and move to filez, the split will look like:
Code:
split(a[112114], b)
You can probably see from this that the 'a' array has no such index. Also, the concept of what you thought was happening would be mute as FS is set to a comma
hence $1 = 112 and $2 = 114.

Hope that helps clear up why you are not getting the expected results.

(*) - try reading the filez first and using something like:
Code:
NR == FNR{a[$1] = $2;next}
I will let you write the rest
 
Old 01-07-2013, 02:35 PM   #6
bop-a-nator
LQ Newbie
 
Registered: Sep 2012
Location: North East USA
Distribution: at work: Red Hat Enterprise Linux Server release 5.8 (Tikanga); at home: what do you recommend?
Posts: 24

Original Poster
Rep: Reputation: Disabled
I can see what the split does:

awk 'BEGIN{OSF=FS=","}NR==FNR{split(a[$1$2],b)}{print $0}' filez
112,114


awk 'BEGIN{OSF=FS=","}NR==FNR{split(a[$1$2],b)}{print $2}' filez
114


My thought on how this works is that before the next I am coming up with something to match a pattern on between the two files right?


So with what you explained that would give me this part of the statement:

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}

This seems like it would set a[$1] = 112 from filez

So that is the field/value on what I am matching on, I follow that.

So in the block following I am working filex - I think this is where I am getting confused about how it works.

my thought was the next piece is pulling what matches between the two input files filex b[$3] (112) to filez $3 (112): {b[$3]=$3]}

The the next piece reassigns what goes in to variable $3: {$3=a[2]} which is 114

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}{b[$3]=$3}{$3=a[2]}1' filez filex > filey

then I got this:

cat filey
A,cc,,yyy,nnn

This seems like it should not be difficult, though something about it I am simply not "getting"
 
Old 01-08-2013, 01:11 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
You are correct it is not difficult, but we all start somewhere

Awk use the keyword 'in' to allow you to look at the indexes of an array, eg:
Code:
5 in array{...}
This will perform the task in the braces should the number 5 be an index of the array.

Also, just to backtrack on one of your assumptions:
Quote:
awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}

This seems like it would set a[$1] = 112 from filez
What it will actually set in array 'a' is:
Code:
a[112] = 114
So here we have $1 = 112 which is the index of the array and $2 = 114 which is the value stored at this index.

Again I will leave you to it, but with the hint of using 'in' keyword on array 'a' and checking the index against part of filez
 
Old 01-08-2013, 10:22 AM   #8
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: CentOS
Posts: 3,434

Rep: Reputation: 1506Reputation: 1506Reputation: 1506Reputation: 1506Reputation: 1506Reputation: 1506Reputation: 1506Reputation: 1506Reputation: 1506Reputation: 1506Reputation: 1506
Quote:
Originally Posted by grail View Post
Awk use the keyword 'in' to allow you to look at the indexes of an array, eg:
Code:
5 in array{...}
This will perform the task in the braces should the number 5 be an index of the array.
You have to be a bit careful when using the "in" keyword. Simply referencing a variable or array element is sufficient to create it, so:
Code:
$ awk 'BEGIN {
    if(112 in aa) print "oops"; else print "not there yet"
    if(aa[112] == 114) print "is equal"; else print "not equal"
    if(112 in aa) print "now it is there, aa[112]=\"" aa[112] "\""
    exit 0
}'
not there yet
not equal
now it is there, aa[112]=""
Just testing the value of element "112" in the array caused that element to be created with a null value.
 
Old 01-09-2013, 01:13 PM   #9
bop-a-nator
LQ Newbie
 
Registered: Sep 2012
Location: North East USA
Distribution: at work: Red Hat Enterprise Linux Server release 5.8 (Tikanga); at home: what do you recommend?
Posts: 24

Original Poster
Rep: Reputation: Disabled
I think I got it:

input filez
111,111
112,114
113,113

input file filex
A,bb,111,xxx,nnn
A,cc,112,yyy,nnn
A,dd,113,zzz,ppp

awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next} $3 in a {print $1,$2,a[$3],$4,$5}' filez filex > filey

output file: filey
A,bb,111,xxx,nnn
A,cc,114,yyy,nnn
A,dd,113,zzz,ppp


grail Thank You! Your hints sent me looking and researching to learn more!

I will share that a co-worker looked at my data files and said "that's in dos format" and had me run a dos2unix on the files and that solved some of the problem I was having too.

Thanks again, I certainly see how this works on the command line. Now to figure out how to write it into a script!
bop-a-nator
I will mark this as solved!
 
Old 01-10-2013, 01:52 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,505

Rep: Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890Reputation: 2890
Here is an alternative for you to consider:
Code:
awk 'BEGIN{OFS=FS=","}NR==FNR{a[$1]=$2;next}$3 in a && $3 = a[$3]' filez filex > filey
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Awk : Trying to count lines with user input ViciousBox Linux - Newbie 4 10-03-2011 01:55 AM
Need to give input as addresses to a tool through awk vinaytp Linux - Newbie 3 02-07-2011 09:17 AM
Plz tell me, how to get input in awk script intikhabalam Linux - General 1 07-27-2008 07:01 AM
user input howto using awk cmontr Programming 11 09-29-2007 07:47 AM
How to make extra stdin input in awk ? khaan Programming 3 07-30-2007 05:04 AM


All times are GMT -5. The time now is 09:05 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration