LinuxQuestions.org - Merging files and removing near-duplicates

Yes, yes there is :)

rub a little perl on it. perl is designed to handle just this sort of "difficult" problem, doing funky things with files of text.

i am assuming that the following output would be acceptable, the files are combined as you describe, but the output sorting isn't exactly as described in your post (1,10,2,3 instead of 1,2,3,10), but should be close enough to import into a database, etc.

Code:

zidane@greymist:~/dev/lq$ ./combine.pl file1.txt file2.txt

#1 John Smith

#10 Greg

#2 Jane Doe

#3 Lord Fnord

J-4 Santa Claus

zidane@greymist:~/dev/lq$

if this is ok, then you're in luck.

the perl script is as follows (dont panic, the actual script is much smaller than it looks here, i put lots of whitespace and comments into my code so people can see what is happening)

Code:

#!/usr/bin/perl

use strict;

use warnings;



#get the filenames

my $filename1 = shift;

my $filename2 = shift;



#create a hash table to store the lines

my %temphash;



#load up the second file, so later we can overwrite this with

#the details in the first file.



#open the second file

open(FILE2,$filename2) or die "could not open $filename2 for reading\n";



#load the second file

while (<FILE2>) {

        

        #regex out the field id

        /^([^\s]+)/;

        

        #and put it in the hash

        $temphash{$1} = $_;

        

}



#close the second file

close(FILE2);







#now, overlay the data with the first file



#open the first file

open(FILE1,$filename1) or die "could not open $filename1 for reading\n";



#load the first file

while (<FILE1>) {

        

        #regex out the fieldid

        /^([^\s]+)/;

        

        #and put it into the hash, overwiting any existing lines

        #from the first file

        $temphash{$1} = $_;

        

}



#and now close the first file

close(FILE1);





#now, for each row we have

foreach my $row ( sort(keys(%temphash)) ) {

        

        #echo the line to screen

        print $temphash{$row};

        

}



#all done, yay!

Since this post is in the newbie section, i'll assume you've never used perl before and break down what you need to do to get this script running on your machine. If you're experienced with perl, feel free to skip ahead.

1. make sure you actually have perl installed (you should, it's a standard tool). To verify you have perl installed, on the command line, run the following command:

Code:

perl -v

if you see perl telling you about itself, you're good to go. if you see anything resembling "command not found", stop right here and install perl.

2. copy the above code into a plain text file, everything from "#!/usr/bin/perl" right down to "#all done, yay!". save the file as "combine.pl"

3. chmod the file to make it executable, so you can run the program, using the following command:

Code:

chmod +x ./combine.pl

4. now you have combine.pl ready to run. you should run the script as follows

Code:

./combine.pl file1.txt file2.txt

the script will read in both files, and output a combined list of entries as described in your post.

Important Note: The script will load both files into memory, if your files are enormous, you should be cautious, as it may run out of memory. make sure your files are smaller than the available memory of your machine. (i.e. if you have 2GB of memory, make sure your files are smaller than 1.5GB)

Hope this solves your need. If you need to tweak it a bit, feel free.

Have fun :)