Merging files and removing near-duplicates

TheBigH · 12-01-2009, 09:05 PM

I have two files that I would like to merge, and remove duplicate entries. The files are text files of columns of data separated by an arbitrary number of spaces; and the first column of data is the identity of the record. That is, if two rows have the same first entry they refer to the same thing even if the rest of the row is different (and they generally will be). What I would like to do is append the entries of the second file onto the first, but only if the same entry does not already occur in the first file.

So for instance, if my files were

#1 John Smith
#2 Jane Doe
#3 Lord Fnord
J-4 Santa Claus

and

#1 Purple Monkey Dishwasher
#10 Greg
J-4 Kris Kringel

I would want the combined file to be

#1 John Smith
#2 Jane Doe
#3 Lord Fnord
J-4 Santa Claus
#10 Greg

Is there a quick and painless way to do this?

Thanks,
The Big H

zidane_tribal · 12-02-2009, 06:01 AM

Yes, yes there is

rub a little perl on it. perl is designed to handle just this sort of "difficult" problem, doing funky things with files of text.

i am assuming that the following output would be acceptable, the files are combined as you describe, but the output sorting isn't exactly as described in your post (1,10,2,3 instead of 1,2,3,10), but should be close enough to import into a database, etc.

Code:

zidane@greymist:~/dev/lq$ ./combine.pl file1.txt file2.txt
#1 John Smith
#10 Greg
#2 Jane Doe
#3 Lord Fnord
J-4 Santa Claus
zidane@greymist:~/dev/lq$

if this is ok, then you're in luck.

the perl script is as follows (dont panic, the actual script is much smaller than it looks here, i put lots of whitespace and comments into my code so people can see what is happening)

Code:

#!/usr/bin/perl
use strict;
use warnings;

#get the filenames
my $filename1 = shift;
my $filename2 = shift;

#create a hash table to store the lines
my %temphash;

#load up the second file, so later we can overwrite this with
#the details in the first file.

#open the second file
open(FILE2,$filename2) or die "could not open $filename2 for reading\n";

#load the second file
while (<FILE2>) {
	
	#regex out the field id
	/^([^\s]+)/;
	
	#and put it in the hash
	$temphash{$1} = $_;
	
}

#close the second file
close(FILE2);



#now, overlay the data with the first file

#open the first file
open(FILE1,$filename1) or die "could not open $filename1 for reading\n";

#load the first file
while (<FILE1>) {
	
	#regex out the fieldid
	/^([^\s]+)/;
	
	#and put it into the hash, overwiting any existing lines
	#from the first file
	$temphash{$1} = $_;
	
}

#and now close the first file
close(FILE1);


#now, for each row we have
foreach my $row ( sort(keys(%temphash)) ) {
	
	#echo the line to screen
	print $temphash{$row};
	
}

#all done, yay!

Since this post is in the newbie section, i'll assume you've never used perl before and break down what you need to do to get this script running on your machine. If you're experienced with perl, feel free to skip ahead.

1. make sure you actually have perl installed (you should, it's a standard tool). To verify you have perl installed, on the command line, run the following command:

Code:

perl -v

if you see perl telling you about itself, you're good to go. if you see anything resembling "command not found", stop right here and install perl.

2. copy the above code into a plain text file, everything from "#!/usr/bin/perl" right down to "#all done, yay!". save the file as "combine.pl"

3. chmod the file to make it executable, so you can run the program, using the following command:

Code:

chmod +x ./combine.pl

4. now you have combine.pl ready to run. you should run the script as follows

Code:

./combine.pl file1.txt file2.txt

the script will read in both files, and output a combined list of entries as described in your post.

Important Note: The script will load both files into memory, if your files are enormous, you should be cautious, as it may run out of memory. make sure your files are smaller than the available memory of your machine. (i.e. if you have 2GB of memory, make sure your files are smaller than 1.5GB)

Hope this solves your need. If you need to tweak it a bit, feel free.

Have fun

berbae · 12-02-2009, 08:31 AM

In bash, if file1 is the first file and file2 the second :

Code:

sort file1 >filesorted1
sort file2 >filesorted2
cat filesorted1 >file3 && join -v 2 filesorted1 filesorted2 >>file3
cat file3
#1 John Smith
#2 Jane Doe
#3 Lord Fnord
J-4 Santa Claus
#10 Greg

The join command looks into file2 for lines that has a first field with no match in file1 and output the line.
file1 and file2 have to be sorted before.

To me it seems a quick and painless way to do this.

TheBigH · 12-02-2009, 04:24 PM

Quote:

Originally Posted by zidane_tribal

Hope this solves your need. If you need to tweak it a bit, feel free.

Have fun

Woohoo! It works perfectly; the only thing I need to do is remove any leading spaces in the input files.

Thank you very much.

H.