LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-01-2009, 09:05 PM   #1
TheBigH
LQ Newbie
 
Registered: Dec 2009
Posts: 14

Rep: Reputation: 0
Merging files and removing near-duplicates


I have two files that I would like to merge, and remove duplicate entries. The files are text files of columns of data separated by an arbitrary number of spaces; and the first column of data is the identity of the record. That is, if two rows have the same first entry they refer to the same thing even if the rest of the row is different (and they generally will be). What I would like to do is append the entries of the second file onto the first, but only if the same entry does not already occur in the first file.

So for instance, if my files were

#1 John Smith
#2 Jane Doe
#3 Lord Fnord
J-4 Santa Claus

and

#1 Purple Monkey Dishwasher
#10 Greg
J-4 Kris Kringel

I would want the combined file to be

#1 John Smith
#2 Jane Doe
#3 Lord Fnord
J-4 Santa Claus
#10 Greg

Is there a quick and painless way to do this?

Thanks,
The Big H
 
Old 12-02-2009, 06:01 AM   #2
zidane_tribal
Member
 
Registered: Apr 2005
Location: chained to my console.
Distribution: LFS 6.1
Posts: 143

Rep: Reputation: 18
Rub a little perl on it....

Yes, yes there is

rub a little perl on it. perl is designed to handle just this sort of "difficult" problem, doing funky things with files of text.

i am assuming that the following output would be acceptable, the files are combined as you describe, but the output sorting isn't exactly as described in your post (1,10,2,3 instead of 1,2,3,10), but should be close enough to import into a database, etc.

Code:
zidane@greymist:~/dev/lq$ ./combine.pl file1.txt file2.txt
#1 John Smith
#10 Greg
#2 Jane Doe
#3 Lord Fnord
J-4 Santa Claus
zidane@greymist:~/dev/lq$
if this is ok, then you're in luck.

the perl script is as follows (dont panic, the actual script is much smaller than it looks here, i put lots of whitespace and comments into my code so people can see what is happening)

Code:
#!/usr/bin/perl
use strict;
use warnings;

#get the filenames
my $filename1 = shift;
my $filename2 = shift;

#create a hash table to store the lines
my %temphash;

#load up the second file, so later we can overwrite this with
#the details in the first file.

#open the second file
open(FILE2,$filename2) or die "could not open $filename2 for reading\n";

#load the second file
while (<FILE2>) {
	
	#regex out the field id
	/^([^\s]+)/;
	
	#and put it in the hash
	$temphash{$1} = $_;
	
}

#close the second file
close(FILE2);



#now, overlay the data with the first file

#open the first file
open(FILE1,$filename1) or die "could not open $filename1 for reading\n";

#load the first file
while (<FILE1>) {
	
	#regex out the fieldid
	/^([^\s]+)/;
	
	#and put it into the hash, overwiting any existing lines
	#from the first file
	$temphash{$1} = $_;
	
}

#and now close the first file
close(FILE1);


#now, for each row we have
foreach my $row ( sort(keys(%temphash)) ) {
	
	#echo the line to screen
	print $temphash{$row};
	
}

#all done, yay!
Since this post is in the newbie section, i'll assume you've never used perl before and break down what you need to do to get this script running on your machine. If you're experienced with perl, feel free to skip ahead.

1. make sure you actually have perl installed (you should, it's a standard tool). To verify you have perl installed, on the command line, run the following command:
Code:
perl -v
if you see perl telling you about itself, you're good to go. if you see anything resembling "command not found", stop right here and install perl.


2. copy the above code into a plain text file, everything from "#!/usr/bin/perl" right down to "#all done, yay!". save the file as "combine.pl"


3. chmod the file to make it executable, so you can run the program, using the following command:
Code:
chmod +x ./combine.pl

4. now you have combine.pl ready to run. you should run the script as follows
Code:
./combine.pl file1.txt file2.txt
the script will read in both files, and output a combined list of entries as described in your post.


Important Note: The script will load both files into memory, if your files are enormous, you should be cautious, as it may run out of memory. make sure your files are smaller than the available memory of your machine. (i.e. if you have 2GB of memory, make sure your files are smaller than 1.5GB)

Hope this solves your need. If you need to tweak it a bit, feel free.

Have fun
 
Old 12-02-2009, 08:31 AM   #3
berbae
Member
 
Registered: Jul 2005
Location: France
Distribution: Arch Linux
Posts: 540

Rep: Reputation: Disabled
In bash, if file1 is the first file and file2 the second :
Code:
sort file1 >filesorted1
sort file2 >filesorted2
cat filesorted1 >file3 && join -v 2 filesorted1 filesorted2 >>file3
cat file3
#1 John Smith
#2 Jane Doe
#3 Lord Fnord
J-4 Santa Claus
#10 Greg
The join command looks into file2 for lines that has a first field with no match in file1 and output the line.
file1 and file2 have to be sorted before.

To me it seems a quick and painless way to do this.

Last edited by berbae; 12-02-2009 at 03:55 PM.
 
Old 12-02-2009, 04:24 PM   #4
TheBigH
LQ Newbie
 
Registered: Dec 2009
Posts: 14

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by zidane_tribal View Post

Hope this solves your need. If you need to tweak it a bit, feel free.

Have fun
Woohoo! It works perfectly; the only thing I need to do is remove any leading spaces in the input files.

Thank you very much.

H.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
BASH out duplicates from multiple text files smudge|lala Linux - General 3 09-24-2008 07:51 PM
LXer: Sorting Perl Lists And Removing Duplicates On Linux Or Unix LXer Syndicated Linux News 0 09-04-2008 05:20 AM
Removing Linux Partition and merging that space with my other Windows Partition RamenBooko Linux - General 3 10-11-2007 02:47 PM
Comparing 2 Files for Duplicates Mr_H Linux - Newbie 5 11-09-2005 12:43 PM
Removing two Windows partitions under SuSE 9.2 and merging them with Linx partition. crozewski Linux - General 4 04-16-2005 10:57 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:12 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration