awk merging two files with different columns based on condition
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
All the awks read the files you've specified on the command line one at a time, in order, one line at a time. You are trying to read two files at once which requires a bit of trickery - you can either read one of the files into memory first, a technique which usually will not scale to any real world task, or you can use getline.
Trying to parse this code you posted, it looks like you might be attempting the first method.
FNR only equals NR when you are reading the first file, 1.txt. So, while you are reading that file, the first action block is executed, and the "next" statement causes the second action block to be skipped.
When you are reading the second file, 2.txt, FNR does not equal NR, so the first action block is skipped and the second is executed.
The space is a string concatenation operator in all awks. So for every line in the first file, you are comparing an undefined variable a[$2] with a string built from $2 FS $3. It's a meaningless comparison and does nothing at all that I can see.
Then for each line in the second file you print out the line followed by an undefined variable a[$1].
Since you've never defined any element of the array a[] at any point, the code makes no sense...
If you can describe the problem better, someone can probably help. But I don't understand what you mean by "paste the line underneath that row". Which line? What row? Can you provide some examples of what the output should look like?
You have been offered snippets of awk in your previous threads - you would be better of using structure, similar to but not as verbose, as you would in python rather than trying to write "minimalist" code.
Some of us enjoy the latter, but it won't help you learn awk.
Thanks. There aren't any matching columns to key on however. So some interpretation needs to happen in the background. Is it ok to consider g_1 in file 2 equivalent to g1 in file 1 and g_2 in file 2 equivalent to g2 in file 1 for the sake of comparison?
Maybe just using the default collating sequence won't work for you.
If not, I would look into the possibility of normalizing the data somehow so that it will work.
If you recode the keys carefully, you could translate them back after the merge.
And, why does g_3 ...
come after g4 ...
in your sample?
In any case, it looks like a standard merge with the addition of a user supplied key comparison function.
I don't know of a utility that offers such an option.
Last edited by josephj; 08-16-2017 at 04:01 AM.
Reason: refined answer
OK, the first lines of your input files seem to be headers which must be skipped. Now this makes slightly more sense.
But the output you posted does not match the algorithm you proposed... you said "By comparing the second column, if the value in second file is greater than the value in first file, then, paste the line underneath that row."
You output sample shows what you meant by line and row, but that algorithm would cause this output, not the output you posted:
IFF these files are small, so that you can safely read in 2.txt without overflowing memory, here is a verbose, brute force way to get the output I've just shown.
Code:
gawk 'BEGIN {
while(( getline line[count]<"2.txt") > 0 ) {
split(line[count], dummy)
key[count]=dummy[2]
count++
}
}
(NR!=1) {
print $0
for (i = 1; i <= count; i++) {
if ($2 < key[i]) {
print line[i]
}
}
}' 1.txt
In the BEGIN rule we read the entire file 2.txt into the array line. Since uninitialized awk variables equate numerically to zero the header line will be in line[0] which makes it easy to skip later. While we are building that array, we split out the values of the second column in each line into a second array key, for comparison later.
Then we read each line of the file 1.txt, and if it's not the first line then print it, and print all rows from the array that match the criteria you previously stated.
This code hasn't been executed, it's out of my head, but it should work.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.