LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-15-2017, 09:13 AM   #1
Asoo
LQ Newbie
 
Registered: Apr 2017
Posts: 29

Rep: Reputation: Disabled
awk merging two files with different columns based on condition


Greetings!

I am trying to merge two files. My first file looks like this:

Quote:
gene1 start1 end1 gene2 start2 end2
abc 1 2 efg 2 3
xyz 3 5 por 3 5
Second file looks like this:

Quote:
gene start end
def 2 9
By comparing the second column, if the value in second file is greater than the value in first file, then, paste the line underneath that row.

First file contain 6 columns and second file contain 3 columns.

I have tried the following code but it's not working:

Code:
awk 'FNR==NR{a[$2]<$2 FS $3;next}{ print $0, a[$1]}' 1.txt 2.txt
Thanks!
 
Old 08-15-2017, 09:50 AM   #2
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 5,607
Blog Entries: 3

Rep: Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851
Can you please add a few more lines of data for each file and show what the final results for that data should look like?
 
Old 08-15-2017, 10:24 AM   #3
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 5,625

Rep: Reputation: 2197Reputation: 2197Reputation: 2197Reputation: 2197Reputation: 2197Reputation: 2197Reputation: 2197Reputation: 2197Reputation: 2197Reputation: 2197Reputation: 2197
I agree that more detail is required. Based on what has been presented, then
Code:
tail -n +2 1.txt > 3.txt; tail -n +2 2.txt > 4.txt; sort -mk2 3.txt 4.txt
would produce
Quote:
abc 1 2 efg 2 3
def 2 9
xyz 3 5 por 3 5
which seems to meet your specification, but is likely to fail with a more complex data set.
 
Old 08-15-2017, 11:05 AM   #4
Medievalist
Member
 
Registered: Aug 2003
Distribution: Dead Rat
Posts: 191

Rep: Reputation: 55
All the awks read the files you've specified on the command line one at a time, in order, one line at a time. You are trying to read two files at once which requires a bit of trickery - you can either read one of the files into memory first, a technique which usually will not scale to any real world task, or you can use getline.

Trying to parse this code you posted, it looks like you might be attempting the first method.

Quote:
awk 'FNR==NR{a[$2]<$2 FS $3;next}{ print $0, a[$1]}' 1.txt 2.txt
FNR only equals NR when you are reading the first file, 1.txt. So, while you are reading that file, the first action block is executed, and the "next" statement causes the second action block to be skipped.

When you are reading the second file, 2.txt, FNR does not equal NR, so the first action block is skipped and the second is executed.

The space is a string concatenation operator in all awks. So for every line in the first file, you are comparing an undefined variable a[$2] with a string built from $2 FS $3. It's a meaningless comparison and does nothing at all that I can see.

Then for each line in the second file you print out the line followed by an undefined variable a[$1].

Since you've never defined any element of the array a[] at any point, the code makes no sense...

If you can describe the problem better, someone can probably help. But I don't understand what you mean by "paste the line underneath that row". Which line? What row? Can you provide some examples of what the output should look like?
 
1 members found this post helpful.
Old 08-15-2017, 05:56 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 19,695

Rep: Reputation: 3547Reputation: 3547Reputation: 3547Reputation: 3547Reputation: 3547Reputation: 3547Reputation: 3547Reputation: 3547Reputation: 3547Reputation: 3547Reputation: 3547
You have been offered snippets of awk in your previous threads - you would be better of using structure, similar to but not as verbose, as you would in python rather than trying to write "minimalist" code.
Some of us enjoy the latter, but it won't help you learn awk.
 
Old 08-16-2017, 01:35 AM   #6
Asoo
LQ Newbie
 
Registered: Apr 2017
Posts: 29

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Turbocapitalist View Post
Can you please add a few more lines of data for each file and show what the final results for that data should look like?
Here are the few more lines:

1.txt:
Code:
gene1 start1 end1 gene2 start2 end2
g1 1 2 gg1 2 3
g2 3 5 gg2 3 5
g3 10 12 gg3 5 9
g4 4 9 gg4 8 9
2.txt:
Code:
gene start end
g_1 2 9
g_2 3 10
g_3 5 9
Output should look like this:
Code:
g1 1 2 gg1 2 3
g_1 2 9
g_2 3 10
g2 3 5 gg2 3 5
g3 10 12 gg3 5 9
g4 4 9 gg4 8 9
g_3 5 9
 
Old 08-16-2017, 01:49 AM   #7
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 5,607
Blog Entries: 3

Rep: Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851Reputation: 2851
Thanks. There aren't any matching columns to key on however. So some interpretation needs to happen in the background. Is it ok to consider g_1 in file 2 equivalent to g1 in file 1 and g_2 in file 2 equivalent to g2 in file 1 for the sake of comparison?
 
Old 08-16-2017, 03:39 AM   #8
josephj
Member
 
Registered: Nov 2007
Location: Northeastern USA
Distribution: kubuntu
Posts: 204

Rep: Reputation: 106Reputation: 106
Wrong tool?

Why isn't this just a vanilla (sort)/merge?

Code:
sort --merge {1,2}.txt
yields

Code:
g1 1 2 gg1 2 3
g_1 2 9
g_2 3 10
g2 3 5 gg2 3 5
g3 10 12 gg3 5 9
g_3 5 9
g4 4 9 gg4 8 9

Maybe just using the default collating sequence won't work for you.
If not, I would look into the possibility of normalizing the data somehow so that it will work.
If you recode the keys carefully, you could translate them back after the merge.

And, why does g_3 ...
come after g4 ...
in your sample?

In any case, it looks like a standard merge with the addition of a user supplied key comparison function.
I don't know of a utility that offers such an option.

Last edited by josephj; 08-16-2017 at 04:01 AM. Reason: refined answer
 
Old 08-16-2017, 12:27 PM   #9
Medievalist
Member
 
Registered: Aug 2003
Distribution: Dead Rat
Posts: 191

Rep: Reputation: 55
OK, the first lines of your input files seem to be headers which must be skipped. Now this makes slightly more sense.

But the output you posted does not match the algorithm you proposed... you said "By comparing the second column, if the value in second file is greater than the value in first file, then, paste the line underneath that row."

You output sample shows what you meant by line and row, but that algorithm would cause this output, not the output you posted:

Code:
g1 1 2 gg1 2 3
g_1 2 9
g_2 3 10
g_3 5 9
g2 3 5 gg2 3 5
g_3 5 9
g3 10 12 gg3 5 9
g4 4 9 gg4 8 9
g_3 5 9
IFF these files are small, so that you can safely read in 2.txt without overflowing memory, here is a verbose, brute force way to get the output I've just shown.

Code:
gawk 'BEGIN {
        while(( getline line[count]<"2.txt") > 0 ) {
          split(line[count], dummy)
          key[count]=dummy[2]
          count++
        }
      }
      (NR!=1) {
        print $0
        for (i = 1; i <= count; i++) {
          if ($2 < key[i]) {
            print line[i]
          }
        }
      }' 1.txt
In the BEGIN rule we read the entire file 2.txt into the array line. Since uninitialized awk variables equate numerically to zero the header line will be in line[0] which makes it easy to skip later. While we are building that array, we split out the values of the second column in each line into a second array key, for comparison later.

Then we read each line of the file 1.txt, and if it's not the first line then print it, and print all rows from the array that match the criteria you previously stated.

This code hasn't been executed, it's out of my head, but it should work.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Merging columns from different files and renaming columns lharrisl Linux - Newbie 7 05-23-2016 07:36 AM
merge columns from multiple files in a directory based on match of two columns prasanthi yanamala Linux - Newbie 2 11-12-2015 10:11 AM
Merging 2 files with awk crazyspice Linux - Newbie 1 02-24-2014 06:02 PM
awk split single column into multiple columns based on RS wolverene13 Programming 11 11-01-2012 05:07 PM
[SOLVED] merging columns from different files leonardo2887 Linux - Newbie 4 10-11-2010 02:28 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:50 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration