LinuxQuestions.org - awk merging two files with different columns based on condition

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - awk merging two files with different columns based on condition (https://www.linuxquestions.org/questions/linux-newbie-8/awk-merging-two-files-with-different-columns-based-on-condition-4175611994/)

awk merging two files with different columns based on condition

Greetings!

I am trying to merge two files. My first file looks like this:

Quote:

gene1 start1 end1 gene2 start2 end2
abc 1 2 efg 2 3
xyz 3 5 por 3 5

Second file looks like this:

Quote:

gene start end
def 2 9

By comparing the second column, if the value in second file is greater than the value in first file, then, paste the line underneath that row.

First file contain 6 columns and second file contain 3 columns.

I have tried the following code but it's not working:

Code:

awk 'FNR==NR{a[$2]<$2 FS $3;next}{ print $0, a[$1]}' 1.txt 2.txt

Thanks!

Can you please add a few more lines of data for each file and show what the final results for that data should look like?

I agree that more detail is required. Based on what has been presented, then

Code:

tail -n +2 1.txt > 3.txt; tail -n +2 2.txt > 4.txt; sort -mk2 3.txt 4.txt

would produce

Quote:

abc 1 2 efg 2 3
def 2 9
xyz 3 5 por 3 5

which seems to meet your specification, but is likely to fail with a more complex data set.

All the awks read the files you've specified on the command line one at a time, in order, one line at a time. You are trying to read two files at once which requires a bit of trickery - you can either read one of the files into memory first, a technique which usually will not scale to any real world task, or you can use getline.

Trying to parse this code you posted, it looks like you might be attempting the first method.

Quote:

awk 'FNR==NR{a[$2]<$2 FS $3;next}{ print $0, a[$1]}' 1.txt 2.txt

FNR only equals NR when you are reading the first file, 1.txt. So, while you are reading that file, the first action block is executed, and the "next" statement causes the second action block to be skipped.

When you are reading the second file, 2.txt, FNR does not equal NR, so the first action block is skipped and the second is executed.

The space is a string concatenation operator in all awks. So for every line in the first file, you are comparing an undefined variable a[$2] with a string built from $2 FS $3. It's a meaningless comparison and does nothing at all that I can see.

Then for each line in the second file you print out the line followed by an undefined variable a[$1].

Since you've never defined any element of the array a[] at any point, the code makes no sense...

If you can describe the problem better, someone can probably help. But I don't understand what you mean by "paste the line underneath that row". Which line? What row? Can you provide some examples of what the output should look like?

You have been offered snippets of awk in your previous threads - you would be better of using structure, similar to but not as verbose, as you would in python rather than trying to write "minimalist" code.
Some of us enjoy the latter, but it won't help you learn awk.

Quote:

Originally Posted by Turbocapitalist (Post 5748505)

Can you please add a few more lines of data for each file and show what the final results for that data should look like?

Here are the few more lines:

1.txt:

Code:

gene1 start1 end1 gene2 start2 end2

g1 1 2 gg1 2 3

g2 3 5 gg2 3 5

g3 10 12 gg3 5 9

g4 4 9 gg4 8 9

2.txt:

Code:

gene start end

g_1 2 9

g_2 3 10

g_3 5 9

Output should look like this:

Code:

g1 1 2 gg1 2 3

g_1 2 9

g_2 3 10

g2 3 5 gg2 3 5

g3 10 12 gg3 5 9

g4 4 9 gg4 8 9

g_3 5 9

Thanks. There aren't any matching columns to key on however. So some interpretation needs to happen in the background. Is it ok to consider g_1 in file 2 equivalent to g1 in file 1 and g_2 in file 2 equivalent to g2 in file 1 for the sake of comparison?

Why isn't this just a vanilla (sort)/merge?

Code:

sort --merge {1,2}.txt

yields

Code:

g1 1 2 gg1 2 3

g_1 2 9

g_2 3 10

g2 3 5 gg2 3 5

g3 10 12 gg3 5 9

g_3 5 9

g4 4 9 gg4 8 9

Maybe just using the default collating sequence won't work for you.
If not, I would look into the possibility of normalizing the data somehow so that it will work.
If you recode the keys carefully, you could translate them back after the merge.

And, why does g_3 ...
come after g4 ...
in your sample?

In any case, it looks like a standard merge with the addition of a user supplied key comparison function.
I don't know of a utility that offers such an option.

OK, the first lines of your input files seem to be headers which must be skipped. Now this makes slightly more sense.

But the output you posted does not match the algorithm you proposed... you said "By comparing the second column, if the value in second file is greater than the value in first file, then, paste the line underneath that row."

You output sample shows what you meant by line and row, but that algorithm would cause this output, not the output you posted:

Code:

g1 1 2 gg1 2 3

g_1 2 9

g_2 3 10

g_3 5 9

g2 3 5 gg2 3 5

g_3 5 9

g3 10 12 gg3 5 9

g4 4 9 gg4 8 9

g_3 5 9

IFF these files are small, so that you can safely read in 2.txt without overflowing memory, here is a verbose, brute force way to get the output I've just shown.

Code:

gawk 'BEGIN {

        while(( getline line[count]<"2.txt") > 0 ) {

          split(line[count], dummy)

          key[count]=dummy[2]

          count++

        }

      }

      (NR!=1) {

        print $0

        for (i = 1; i <= count; i++) {

          if ($2 < key[i]) {

            print line[i]

          }

        }

      }' 1.txt

In the BEGIN rule we read the entire file 2.txt into the array line. Since uninitialized awk variables equate numerically to zero the header line will be in line[0] which makes it easy to skip later. While we are building that array, we split out the values of the second column in each line into a second array key, for comparison later.

Then we read each line of the file 1.txt, and if it's not the first line then print it, and print all rows from the array that match the criteria you previously stated.

This code hasn't been executed, it's out of my head, but it should work.