merge two files with different lengths

ClaraW · 04-19-2013, 08:33 PM

Hi there,

I have two very long files like:

file1: two fields

Code:

file2: 6 fields

Code:

1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1
...

The numbers of lines in file1 and file2 are not equal.

I want to get an output file like

file3: 6 fields, and the first two fields are exactly the same as the first two fields in file1. For example, the line with the first two field "1 123" has a match in file2: "1 123 0 1 0 0", then print the whole line in file2:"1 123 0 1 0 0" to file3. If one line in file1 does not have a match in file2, e.g. "1 125", then print "1 125 0 0 0 0" to file3.

Code:

1 123 0 1 0 0
1 125 0 0 0 0 
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10  0 0 0 0
3 215 1 1 2 1
4 56  0 0 0 0
...

I am wondering if this can be done using awk or join or any other in linux? Since the files are very large, I really want it to be fast. Thanks a lot~~~

Note: Field 2 in both file1 and file2 has only number values, but field 1 in both files may have characters too. The two fields are sorted. And in both files, this kind of situation will not happen, no duplicates.

Code:

1 123
1 123
...

Also we do not need to consider the lines in file2 which do not have any match in file1, for example "1 126 2 1 0 0" (no match in file1), then this line should not be added to file3.

suicidaleggroll · 04-19-2013, 08:57 PM

If it were me, I'd write a script to leap frog through the files. Since you know the two files are sorted and uniqued, this would be ideal in my opinion.

The basic process would be:

0) Read a line from file1
1) Read through file2 until you match or exceed the last read value from file1
2) If the first two fields in the two lines match, then copy the line from file2 into file3
3) Read through file1 until you match or exceed the last read value from file2 (assuming #2 failed)
4) If the first two fields in the two lines match, then copy the line from file2 into file3
5) Go back to #1

David the H. · 04-21-2013, 03:08 PM

If the files aren't too large to be held in RAM, then you can simply load one file into an array, using fields one and two as the index values, then use that to print out the matching lines from the other file.

Code:

awk 'NR==FNR { a[$1$2]=$0 } ; NR!=FNR { b[FNR]= a[$1$2] ? a[$1$2] : $0" "0" "0" "0" "0 } ; END{ for (i=1;i<=FNR;i++){ print b[i] } }' file2.txt file1.txt

If they're too large to handle all at once, then you'll have to go with the above.

ClaraW · 04-24-2013, 01:17 PM

Hi David the H.,

Thanks a lot for your help! But what if my data is really really large? Can this be used still?

David the H. · 04-25-2013, 12:11 PM

No, probably not, since it has to store the whole file in memory first. You'll have to use an algorithm like suicidaleggroll posted.

Unfortunately though that's getting a bit beyond my ability in awk. I'm not that experienced in multi-file processing with it. Perhaps grail will come along soon and show us how it's done.

It would probably be even better to use a languag like perl, but again that's something I don't know much of. I'm mostly just a bash person at this stage. I could write it up as a shell script, but that would be dog slow, and probably take hours to process.

danielbmartin · 04-26-2013, 07:06 AM

Consider this ...

Code:

sed 's/$/ 0 0 0 0/' $InFile1               \
|sort -m - $InFile2                        \
|awk -F " " '{t=$0; t1=$1; t2=$2; getline;
  if (t1 t2!=$1 $2) print t; else print}
  END {print}' >$OutFile

Daniel B. Martin

ClaraW · 04-26-2013, 01:42 PM

Hi danielbmartin,

Thank you for the scripts, but it adds all the lines in file1 and file2 together. I only want those in file1. I am wondering if this can be done.

danielbmartin · 04-26-2013, 05:23 PM

Quote:

Originally Posted by ClaraW

Thank you for the scripts, but it adds all the lines in file1 and file2 together. I only want those in file1. I am wondering if this can be done.

Please do this ...
1) Carefully proofread your problem statement. Did you use the word field in any place where you meant file?
2) Extend your sample input files and corresponding output file.

These steps will aid comprehension.

Daniel B. Martin

ClaraW · 04-26-2013, 06:35 PM

Hi Daniel,

I changed the statement. Is it clearer now?

Thanks,

Clara

allend · 04-26-2013, 10:16 PM

This is my bash solution that seems to do what you want. It will not be the fastest solution, but would complete in less time than this thread has been running.

Code:

#!/bin/bash

if1="tf1.txt"
if2="tf2.txt"
of1="tf3.txt"

while read line1 ; do
  line2=$(grep -m 1 "$line1" $if2);
  if [[ $line2 ]] ; then
    echo "$line2" >> $of1;
  else
    echo "$line1 0 0 0 0" >> $of1;
  fi
done < $if1

danielbmartin · 04-28-2013, 11:52 AM

Quote:

Originally Posted by ClaraW

I changed the statement. Is it clearer now?

Yes.

Try this ...

Code:

 sort -m $InFile2 $InFile1                  \
|awk -F " " '{if (prev2==$1 $2) print $0
         else if (prevNF==2)    print prev,"0 0 0 0"
         prev2=$1 $2; prev=$0; prevNF=NF}'  \
>$OutFile

Daniel B. Martin