LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-19-2013, 08:33 PM   #1
ClaraW
LQ Newbie
 
Registered: Apr 2013
Location: Vancouver, Canada
Posts: 4

Rep: Reputation: Disabled
merge two files with different lengths


Hi there,

I have two very long files like:

file1: two fields

Code:
1 123 
1 125
1 234
2 123
2 234
2 300
2 312
3 10
3 215
4 56
...
file2: 6 fields

Code:
1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1
...
The numbers of lines in file1 and file2 are not equal.

I want to get an output file like

file3: 6 fields, and the first two fields are exactly the same as the first two fields in file1. For example, the line with the first two field "1 123" has a match in file2: "1 123 0 1 0 0", then print the whole line in file2:"1 123 0 1 0 0" to file3. If one line in file1 does not have a match in file2, e.g. "1 125", then print "1 125 0 0 0 0" to file3.

Code:
1 123 0 1 0 0
1 125 0 0 0 0 
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10  0 0 0 0
3 215 1 1 2 1
4 56  0 0 0 0
...
I am wondering if this can be done using awk or join or any other in linux? Since the files are very large, I really want it to be fast. Thanks a lot~~~

Note: Field 2 in both file1 and file2 has only number values, but field 1 in both files may have characters too. The two fields are sorted. And in both files, this kind of situation will not happen, no duplicates.

Code:
1 123
1 123
...
Also we do not need to consider the lines in file2 which do not have any match in file1, for example "1 126 2 1 0 0" (no match in file1), then this line should not be added to file3.

Last edited by ClaraW; 04-26-2013 at 06:36 PM.
 
Old 04-19-2013, 08:57 PM   #2
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,573

Rep: Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142Reputation: 2142
If it were me, I'd write a script to leap frog through the files. Since you know the two files are sorted and uniqued, this would be ideal in my opinion.

The basic process would be:

0) Read a line from file1
1) Read through file2 until you match or exceed the last read value from file1
2) If the first two fields in the two lines match, then copy the line from file2 into file3
3) Read through file1 until you match or exceed the last read value from file2 (assuming #2 failed)
4) If the first two fields in the two lines match, then copy the line from file2 into file3
5) Go back to #1
 
Old 04-21-2013, 03:08 PM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
If the files aren't too large to be held in RAM, then you can simply load one file into an array, using fields one and two as the index values, then use that to print out the matching lines from the other file.

Code:
awk 'NR==FNR { a[$1$2]=$0 } ; NR!=FNR { b[FNR]= a[$1$2] ? a[$1$2] : $0" "0" "0" "0" "0 } ; END{ for (i=1;i<=FNR;i++){ print b[i] } }' file2.txt file1.txt
If they're too large to handle all at once, then you'll have to go with the above.
 
Old 04-24-2013, 01:17 PM   #4
ClaraW
LQ Newbie
 
Registered: Apr 2013
Location: Vancouver, Canada
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hi David the H.,

Thanks a lot for your help! But what if my data is really really large? Can this be used still?
 
Old 04-25-2013, 12:11 PM   #5
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
No, probably not, since it has to store the whole file in memory first. You'll have to use an algorithm like suicidaleggroll posted.

Unfortunately though that's getting a bit beyond my ability in awk. I'm not that experienced in multi-file processing with it. Perhaps grail will come along soon and show us how it's done.

It would probably be even better to use a languag like perl, but again that's something I don't know much of. I'm mostly just a bash person at this stage. I could write it up as a shell script, but that would be dog slow, and probably take hours to process.
 
Old 04-26-2013, 07:06 AM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Consider this ...
Code:
sed 's/$/ 0 0 0 0/' $InFile1               \
|sort -m - $InFile2                        \
|awk -F " " '{t=$0; t1=$1; t2=$2; getline;
  if (t1 t2!=$1 $2) print t; else print}
  END {print}' >$OutFile
Daniel B. Martin

Last edited by danielbmartin; 04-26-2013 at 10:43 AM. Reason: Improved code
 
Old 04-26-2013, 01:42 PM   #7
ClaraW
LQ Newbie
 
Registered: Apr 2013
Location: Vancouver, Canada
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hi danielbmartin,

Thank you for the scripts, but it adds all the lines in file1 and file2 together. I only want those in file1. I am wondering if this can be done.
 
Old 04-26-2013, 05:23 PM   #8
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by ClaraW View Post
Thank you for the scripts, but it adds all the lines in file1 and file2 together. I only want those in file1. I am wondering if this can be done.
Please do this ...
1) Carefully proofread your problem statement. Did you use the word field in any place where you meant file?
2) Extend your sample input files and corresponding output file.

These steps will aid comprehension.

Daniel B. Martin
 
Old 04-26-2013, 06:35 PM   #9
ClaraW
LQ Newbie
 
Registered: Apr 2013
Location: Vancouver, Canada
Posts: 4

Original Poster
Rep: Reputation: Disabled
Hi Daniel,

I changed the statement. Is it clearer now?

Thanks,

Clara
 
Old 04-26-2013, 10:16 PM   #10
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,367

Rep: Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748Reputation: 2748
This is my bash solution that seems to do what you want. It will not be the fastest solution, but would complete in less time than this thread has been running.
Code:
#!/bin/bash

if1="tf1.txt"
if2="tf2.txt"
of1="tf3.txt"

while read line1 ; do
  line2=$(grep -m 1 "$line1" $if2);
  if [[ $line2 ]] ; then
    echo "$line2" >> $of1;
  else
    echo "$line1 0 0 0 0" >> $of1;
  fi
done < $if1
 
Old 04-28-2013, 11:52 AM   #11
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by ClaraW View Post
I changed the statement. Is it clearer now?
Yes.

Try this ...
Code:
 sort -m $InFile2 $InFile1                  \
|awk -F " " '{if (prev2==$1 $2) print $0
         else if (prevNF==2)    print prev,"0 0 0 0"
         prev2=$1 $2; prev=$0; prevNF=NF}'  \
>$OutFile
Daniel B. Martin
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
merge wav files linuxhippy Slackware 11 08-02-2021 12:41 AM
command to merge old with new files said76 Linux - Newbie 2 03-30-2012 06:11 AM
Merge .csv files Swizzgard Linux - Newbie 3 11-27-2010 07:43 AM
compare two files and merge nelex Programming 5 02-25-2009 08:45 AM
merge files, given its odd and even given timepassman Linux - Software 1 05-08-2008 01:17 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 11:54 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration