Visit Jeremy's Blog.
Go Back > Forums > Linux Forums > Linux - General
User Name
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.


  Search this Thread
Old 09-24-2008, 03:52 PM   #1
Registered: Jan 2004
Location: Hertford
Distribution: Mint | Sabayon
Posts: 151

Rep: Reputation: 15
BASH out duplicates from multiple text files

Hi all, I am forever dealing with vast lists typically exported to text files with records split by carriage return.

I use a python script to create the files I need but due to the sizes produced, I have to split them into manageable files or the text editor crashes or I run out of memory (or both).

I have been reading about the awk, sort and uniq commands as a way to filter out any duplicate lines that may have been produced across the files but do not know how to implement it as I wish to work with the file-set as if it were one, not multiple.

Can anyone assist me with my crunching problem with a few handy commands or a small script?

Many thanks in advance.
Old 09-24-2008, 05:59 PM   #2
LQ Guru
Registered: Nov 2004
Location: San Jose, CA
Distribution: Ubuntu
Posts: 8,507

Rep: Reputation: 124Reputation: 124
How big are these files? If they're HUGE, you may be best off with a dedicated app to handle this (or, more likely, a real database engine). Why open records like this in a text editor?

The following would work, if you have a reasonable amount of memory (assumes your files are named set#.txt):
cat set*.txt | sort | uniq
That will remove duplicate lines.
Old 09-24-2008, 06:55 PM   #3
Registered: Jan 2004
Location: Hertford
Distribution: Mint | Sabayon
Posts: 151

Original Poster
Rep: Reputation: 15

Thank you Matir, I shall try this and see if it works. I would like to use a database system and have toyed with Postgres but for the amount of records and frequency of change I'm not sure the best way to handle them.

All are typically ASCII or extended 8 bit from 1 to any length but typically no more than 32. I am using text whilst I better understand databases, compression, speed of access and also find good tutorials on Postgres, which seem to be few. Don't get me wrong, I like MySQL but for this and future projects, I lean heavily towards Postgres. I also have no idea how to load files of 1Mb to 2Tb on the fly (with relative ease), so you see my boggle!

I have looked at SQLite and the firefox plugin to work with the data, but again, I can't find any good tutorials on using it!

Last edited by smudge|lala; 09-24-2008 at 07:00 PM.
Old 09-24-2008, 08:51 PM   #4
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,417

Rep: Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397Reputation: 2397
If you're serious about 2TB files I'd use python if you already know it (from your OP) to do file manipulation, or learn Perl if you're open to suggestions. Its good at that stuff.


bash, command, duplicate, files, script, text

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing text in multiple files ? centosfan Linux - General 10 09-20-2008 08:25 AM
Steps needed to convert multiple text files into one master text file jamtech Programming 5 10-08-2007 12:24 AM
Change text in multiple files in multiple directories vivo2341 Linux - General 5 11-27-2006 09:16 PM
Comparing 2 Files for Duplicates Mr_H Linux - Newbie 5 11-09-2005 01:43 PM
How to replace text in multiple files bpk Linux - Newbie 2 02-10-2004 03:03 PM > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:58 AM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration