Merging files and removing near-duplicates
I have two files that I would like to merge, and remove duplicate entries. The files are text files of columns of data separated by an arbitrary number of spaces; and the first column of data is the identity of the record. That is, if two rows have the same first entry they refer to the same thing even if the rest of the row is different (and they generally will be). What I would like to do is append the entries of the second file onto the first, but only if the same entry does not already occur in the first file.
So for instance, if my files were #1 John Smith #2 Jane Doe #3 Lord Fnord J-4 Santa Claus and #1 Purple Monkey Dishwasher #10 Greg J-4 Kris Kringel I would want the combined file to be #1 John Smith #2 Jane Doe #3 Lord Fnord J-4 Santa Claus #10 Greg Is there a quick and painless way to do this? Thanks, The Big H |
Rub a little perl on it....
Yes, yes there is :)
rub a little perl on it. perl is designed to handle just this sort of "difficult" problem, doing funky things with files of text. i am assuming that the following output would be acceptable, the files are combined as you describe, but the output sorting isn't exactly as described in your post (1,10,2,3 instead of 1,2,3,10), but should be close enough to import into a database, etc. Code:
zidane@greymist:~/dev/lq$ ./combine.pl file1.txt file2.txt the perl script is as follows (dont panic, the actual script is much smaller than it looks here, i put lots of whitespace and comments into my code so people can see what is happening) Code:
#!/usr/bin/perl 1. make sure you actually have perl installed (you should, it's a standard tool). To verify you have perl installed, on the command line, run the following command: Code:
perl -v 2. copy the above code into a plain text file, everything from "#!/usr/bin/perl" right down to "#all done, yay!". save the file as "combine.pl" 3. chmod the file to make it executable, so you can run the program, using the following command: Code:
chmod +x ./combine.pl 4. now you have combine.pl ready to run. you should run the script as follows Code:
./combine.pl file1.txt file2.txt Important Note: The script will load both files into memory, if your files are enormous, you should be cautious, as it may run out of memory. make sure your files are smaller than the available memory of your machine. (i.e. if you have 2GB of memory, make sure your files are smaller than 1.5GB) Hope this solves your need. If you need to tweak it a bit, feel free. Have fun :) |
In bash, if file1 is the first file and file2 the second :
Code:
sort file1 >filesorted1 file1 and file2 have to be sorted before. To me it seems a quick and painless way to do this. |
Quote:
Thank you very much. H. |
All times are GMT -5. The time now is 04:40 AM. |