LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Other *NIX Forums > AIX
User Name
Password
AIX This forum is for the discussion of IBM AIX.
eserver and other IBM related questions are also on topic.

Notices


Reply
  Search this Thread
Old 07-05-2006, 09:45 AM   #1
tmaxx
LQ Newbie
 
Registered: Jun 2006
Posts: 21

Rep: Reputation: 15
Sorting large text files


hi all...

i need to sort a 200MB text file containing 20 million mobile numbers say

9192204783
9192204766
9192204783
9192204778
9192204783
9192204766
9192204767

noticed "sort" has its limitations can anyone suggest anything using sed or awk or something? the purpose of
which is sorting the file so that the duplicate entries
will stick together.

or can anyone suggest a script which gets rid of duplicate entries using sed ?

thanks in advance
 
Old 07-05-2006, 11:54 AM   #2
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335
The sort command does exactly what you want. Read the man page for details on the "--unique" option.
 
Old 07-05-2006, 12:20 PM   #3
Blinker_Fluid
Member
 
Registered: Jul 2003
Location: Clinging to my guns and religion.
Posts: 683

Rep: Reputation: 63
You could split the file so you are dealing with smaller chunks also. (man split for syntax)
If it were me I would split the file into about 4 chunks, run sort -u on each chunk, cat the files back together and split it again to a different size, rinse, lather, repeat until you're happy.
 
Old 07-05-2006, 12:56 PM   #4
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 77
Quote:
Originally Posted by tmaxx
hi all...

i need to sort a 200MB text file containing 20 million mobile numbers say

9192204783
9192204766
9192204783
9192204778
9192204783
9192204766
9192204767

noticed "sort" has its limitations can anyone suggest anything using sed or awk or something? the purpose of
which is sorting the file so that the duplicate entries
will stick together.

or can anyone suggest a script which gets rid of duplicate entries using sed ?

thanks in advance
What's wrong with `sort' did you say? Just plain sort will make duplicates stick together. You can pipe that through `uniq' to remove the duplicates.
 
Old 07-06-2006, 10:20 AM   #5
tmaxx
LQ Newbie
 
Registered: Jun 2006
Posts: 21

Original Poster
Rep: Reputation: 15
my machine is a IBM p690 and i have a test partition on this. the limit is like 200K records for sort. anyway thanks for the suggestion ill try it out again.
 
Old 07-06-2006, 12:22 PM   #6
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335
Quote:
Originally Posted by tmaxx
the limit is like 200K records for sort.
There is no such limit for sort. I'm currently sorting a file with 94+ million 10-byte records on my laptop.

Quote:
Originally Posted by Blinker_Fluid
You could split the file so you are dealing with smaller chunks also.
There is no reason to do that. Sort automatically spilts large files into sorted subsets (in the /tmp directory) and then merges them. Otherwise the entire file would have to fit in memory, which is not practical - except for very small files.
 
Old 07-08-2006, 12:12 PM   #7
tmaxx
LQ Newbie
 
Registered: Jun 2006
Posts: 21

Original Poster
Rep: Reputation: 15
can anyone give me the syntax to do it? using sort?
 
Old 07-08-2006, 12:18 PM   #8
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335
The 'man sort' command gives you the documentation. For example:

cat inputfile.txt | sort --output=outputfile.txt --numeric-sort --unique
 
Old 07-12-2006, 02:44 AM   #9
AbrahamJose
Member
 
Registered: Feb 2006
Location: India
Posts: 167

Rep: Reputation: 31
sort -u

sort -u will get rid off duplication.
 
Old 07-12-2006, 07:51 AM   #10
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335
The '-u' and the '--unique' options are the same. From 'man sort':

-u, --unique

with -c, check for strict ordering; without -c, output only the first of an equal run
 
Old 10-01-2007, 10:11 AM   #11
xushi
Senior Member
 
Registered: Jun 2003
Location: UK
Distribution: Gentoo
Posts: 1,288

Rep: Reputation: 45
Guys you just saved me a lot of time!

I have a file with 14,000 emails, each on 1 line. I had to remove the duplicates and was starting to sweat when they told me i had to use awk..

The following command did it in just 3 seconds.

Code:
cat emails.txt | sort --output=outputfile.txt -f --unique
 
Old 10-01-2007, 07:56 PM   #12
DukeSSD
Member
 
Registered: Sep 2007
Posts: 90

Rep: Reputation: 20
macemoneta,
"There is no such limit for sort. I'm currently sorting a file with 94+ million 10-byte records on my laptop."
You run AIX on your laptop, HOW?

Last edited by DukeSSD; 02-19-2009 at 08:06 PM.
 
Old 10-01-2007, 08:06 PM   #13
macemoneta
Senior Member
 
Registered: Jan 2005
Location: Manalapan, NJ
Distribution: Fedora x86 and x86_64, Debian PPC and ARM, Android
Posts: 4,593
Blog Entries: 2

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335
I don't run AIX on my laptop, but sort is a GNU utility. You can run AIX on a loptop, if you are interested in doing so.
 
Old 10-04-2007, 03:44 AM   #14
AbrahamJose
Member
 
Registered: Feb 2006
Location: India
Posts: 167

Rep: Reputation: 31
no confusion

sort -u -n filename
will do the job
 
Old 02-19-2009, 06:32 PM   #15
virtualx
LQ Newbie
 
Registered: Oct 2005
Posts: 22

Rep: Reputation: 15
Thank you!

Wow that was easy and fast! Thank you!

I had a file that was too long for excel, or O.O.O. calc so it wouldn't open.

Spotfire would open and sort the file, but any text file it would export would place the rows back in the original order.

Just wanted to sort, not remove results so I removed the --unique option
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert DOS text files to UNIX text files ta0kira Linux - Software 7 03-15-2011 11:42 AM
sorting the same kind of files in a folder? lenny45 MEPIS 11 05-08-2006 11:45 AM
Sorting files in BASH deleted/ Linux - Newbie 16 01-26-2006 06:03 AM
HELP, sorting files by name with environment ar3ol Linux - Newbie 6 12-05-2005 04:03 PM
Script file to replace large text blocks in files? stodge Linux - Software 0 09-27-2003 10:53 AM

LinuxQuestions.org > Forums > Other *NIX Forums > AIX

All times are GMT -5. The time now is 12:37 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration