LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux > Linux - General
User Name
Password
Linux - General This forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Tags used in this thread
Popular LQ Tags , , , , ,

Reply
 
Thread Tools
Old 09-24-2008, 03:52 PM   #1
smudge|lala
Member
 
Registered: Jan 2004
Location: UK & Canada
Distribution: Sabayon | Mint | Ulteo
Posts: 145
Thanked: 0
BASH out duplicates from multiple text files


[Log in to get rid of this advertisement]
Hi all, I am forever dealing with vast lists typically exported to text files with records split by carriage return.

I use a python script to create the files I need but due to the sizes produced, I have to split them into manageable files or the text editor crashes or I run out of memory (or both).

I have been reading about the awk, sort and uniq commands as a way to filter out any duplicate lines that may have been produced across the files but do not know how to implement it as I wish to work with the file-set as if it were one, not multiple.

Can anyone assist me with my crunching problem with a few handy commands or a small script?

Many thanks in advance.
smudge|lala is offline  
Tag This Post , , , , ,
Reply With Quote
Old 09-24-2008, 05:59 PM   #2
Matir
Moderator
 
Registered: Nov 2004
Location: Atlanta, GA
Distribution: Ubuntu
Posts: 8,347
Thanked: 13
How big are these files? If they're HUGE, you may be best off with a dedicated app to handle this (or, more likely, a real database engine). Why open records like this in a text editor?

The following would work, if you have a reasonable amount of memory (assumes your files are named set#.txt):
Code:
cat set*.txt | sort | uniq
That will remove duplicate lines.
Matir is offline     Reply With Quote
Old 09-24-2008, 06:55 PM   #3
smudge|lala
Member
 
Registered: Jan 2004
Location: UK & Canada
Distribution: Sabayon | Mint | Ulteo
Posts: 145
Thanked: 0

Original Poster
Question

Thank you Matir, I shall try this and see if it works. I would like to use a database system and have toyed with Postgres but for the amount of records and frequency of change I'm not sure the best way to handle them.

All are typically ASCII or extended 8 bit from 1 to any length but typically no more than 32. I am using text whilst I better understand databases, compression, speed of access and also find good tutorials on Postgres, which seem to be few. Don't get me wrong, I like MySQL but for this and future projects, I lean heavily towards Postgres. I also have no idea how to load files of 1Mb to 2Tb on the fly (with relative ease), so you see my boggle!

I have looked at SQLite and the firefox plugin to work with the data, but again, I can't find any good tutorials on using it!

Last edited by smudge|lala; 09-24-2008 at 07:00 PM..
smudge|lala is offline     Reply With Quote
Old 09-24-2008, 08:51 PM   #4
chrism01
Guru
 
Registered: Aug 2004
Location: Brisbane
Distribution: Centos 5.4
Posts: 7,419
Thanked: 325
If you're serious about 2TB files I'd use python if you already know it (from your OP) to do file manipulation, or learn Perl if you're open to suggestions. Its good at that stuff.
chrism01 is offline     Reply With Quote

Reply

Bookmarks


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing text in multiple files ? JACOBKELL Linux - General 10 09-20-2008 08:25 AM
Steps needed to convert multiple text files into one master text file jamtech Programming 5 10-08-2007 12:24 AM
Change text in multiple files in multiple directories vivo2341 Linux - General 5 11-27-2006 09:16 PM
Comparing 2 Files for Duplicates Mr_H Linux - Newbie 5 11-09-2005 01:43 PM
How to replace text in multiple files bpk Linux - Newbie 2 02-10-2004 03:03 PM


All times are GMT -5. The time now is 12:00 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
RSS2  LQ Podcast
RSS2  LQ Radio
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration