BASH out duplicates from multiple text files

smudge|lala · 09-24-2008, 02:52 PM

Hi all, I am forever dealing with vast lists typically exported to text files with records split by carriage return.

I use a python script to create the files I need but due to the sizes produced, I have to split them into manageable files or the text editor crashes or I run out of memory (or both).

I have been reading about the awk, sort and uniq commands as a way to filter out any duplicate lines that may have been produced across the files but do not know how to implement it as I wish to work with the file-set as if it were one, not multiple.

Can anyone assist me with my crunching problem with a few handy commands or a small script?

Many thanks in advance.

Matir · 09-24-2008, 04:59 PM

How big are these files? If they're HUGE, you may be best off with a dedicated app to handle this (or, more likely, a real database engine). Why open records like this in a text editor?

The following would work, if you have a reasonable amount of memory (assumes your files are named set#.txt):

Code:

cat set*.txt | sort | uniq

That will remove duplicate lines.

smudge|lala · 09-24-2008, 05:55 PM

Thank you Matir, I shall try this and see if it works. I would like to use a database system and have toyed with Postgres but for the amount of records and frequency of change I'm not sure the best way to handle them.

All are typically ASCII or extended 8 bit from 1 to any length but typically no more than 32. I am using text whilst I better understand databases, compression, speed of access and also find good tutorials on Postgres, which seem to be few. Don't get me wrong, I like MySQL but for this and future projects, I lean heavily towards Postgres. I also have no idea how to load files of 1Mb to 2Tb on the fly (with relative ease), so you see my boggle!

I have looked at SQLite and the firefox plugin to work with the data, but again, I can't find any good tutorials on using it!

chrism01 · 09-24-2008, 07:51 PM

If you're serious about 2TB files I'd use python if you already know it (from your OP) to do file manipulation, or learn Perl if you're open to suggestions. Its good at that stuff.