LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 09-24-2008, 02:52 PM   #1
smudge|lala
Member
 
Registered: Jan 2004
Location: New Zealand
Distribution: Mint | Sabayon
Posts: 160

Rep: Reputation: 16
BASH out duplicates from multiple text files


Hi all, I am forever dealing with vast lists typically exported to text files with records split by carriage return.

I use a python script to create the files I need but due to the sizes produced, I have to split them into manageable files or the text editor crashes or I run out of memory (or both).

I have been reading about the awk, sort and uniq commands as a way to filter out any duplicate lines that may have been produced across the files but do not know how to implement it as I wish to work with the file-set as if it were one, not multiple.

Can anyone assist me with my crunching problem with a few handy commands or a small script?

Many thanks in advance.
 
Old 09-24-2008, 04:59 PM   #2
Matir
LQ Guru
 
Registered: Nov 2004
Location: San Jose, CA
Distribution: Debian, Arch
Posts: 8,507

Rep: Reputation: 128Reputation: 128
How big are these files? If they're HUGE, you may be best off with a dedicated app to handle this (or, more likely, a real database engine). Why open records like this in a text editor?

The following would work, if you have a reasonable amount of memory (assumes your files are named set#.txt):
Code:
cat set*.txt | sort | uniq
That will remove duplicate lines.
 
Old 09-24-2008, 05:55 PM   #3
smudge|lala
Member
 
Registered: Jan 2004
Location: New Zealand
Distribution: Mint | Sabayon
Posts: 160

Original Poster
Rep: Reputation: 16
Question

Thank you Matir, I shall try this and see if it works. I would like to use a database system and have toyed with Postgres but for the amount of records and frequency of change I'm not sure the best way to handle them.

All are typically ASCII or extended 8 bit from 1 to any length but typically no more than 32. I am using text whilst I better understand databases, compression, speed of access and also find good tutorials on Postgres, which seem to be few. Don't get me wrong, I like MySQL but for this and future projects, I lean heavily towards Postgres. I also have no idea how to load files of 1Mb to 2Tb on the fly (with relative ease), so you see my boggle!

I have looked at SQLite and the firefox plugin to work with the data, but again, I can't find any good tutorials on using it!

Last edited by smudge|lala; 09-24-2008 at 06:00 PM.
 
Old 09-24-2008, 07:51 PM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,356

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
If you're serious about 2TB files I'd use python if you already know it (from your OP) to do file manipulation, or learn Perl if you're open to suggestions. Its good at that stuff.
 
  


Reply

Tags
bash, command, duplicate, files, script, text



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Replacing text in multiple files ? centosfan Linux - General 10 09-20-2008 07:25 AM
Steps needed to convert multiple text files into one master text file jamtech Programming 5 10-07-2007 11:24 PM
Change text in multiple files in multiple directories vivo2341 Linux - General 5 11-27-2006 08:16 PM
Comparing 2 Files for Duplicates Mr_H Linux - Newbie 5 11-09-2005 12:43 PM
How to replace text in multiple files bpk Linux - Newbie 2 02-10-2004 02:03 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:39 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration