LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-09-2008, 11:12 AM   #1
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Rep: Reputation: 15
Note for the faint of heart -- sort command


I have a text file of form:

Strep_excerpt_1900_100_15_164_145/1 Strep_excerpt_1900_100 15 + 150 18 99 81 63 0 0 1 0 35 GaGGAGTGCTGGAACGCATAGAAGtGAgAatGCcG ?2??????????????????????4?<5?'9??/?
Strep_excerpt_1900_100_18_167_365/2 Strep_excerpt_1900_100 18 + 150 18 99 57 57 1 10 0 1 35 GAGTGCTGGAACGcaTAGAAGTGtgAAtgCCggta ??????<??????.+????????+7??&0<?7;**
Strep_excerpt_1900_100_21_170_3bb/1 Strep_excerpt_1900_100 21 + 150 18 99 61 61 1 5 0 1 35 TGCTGGAACGCATAGAAGtGAtAATGCCggtAtgA ????????<??????<?<9<?&??????:$9?/;?
Strep_excerpt_1900_100_22_171_3d0/1 Strep_excerpt_1900_100 22 + 150 18 99 84 78 0 0 1 0 35 GCTGGAACGCATAGAAgTGAGaAtGCCGGTaTGAG ????????????????7?<??:?:<?????:??<?
Strep_excerpt_1900_100_23_172_8b/1 Strep_excerpt_1900_100 23 + 150 18 99 81 81 0 0 1 0 35 CTGGAAcGCATAGAAGTGAGAATGCCGGTAtgaGT ??????7<???<??????>>???>?????</85??
Strep_excerpt_1900_100_25_174_11c/2 Strep_excerpt_1900_100 25 + 150 18 99 78 78 0 0 1 0 35 GGAACGCATAGAAGTGagAATGCCGGTatgagTAG ?????????<?????<0.?????>>??+,9/;???
Strep_excerpt_1900_100_26_175_15b/1 Strep_excerpt_1900_100 26 + 150 18 99 78 78 0 0 1 0 35 GaACGCaTAGaAGTGAGAAtGCCGGtaTGAGtAGc ?:????9???:???????<.?>???27>?<<;?=&

etc -- all on same line, tab delimited. What I want to do is sort the file by the first segment (Strep_excerpt_1900_100_15_164_145/1). I am sorting because each entry as the one just posted, has a match pair that will have the same identifier but a 1 or 2 at the end, */[1,2] and I'd like them to go together. I was thinking of doing this in perl but the memory requirement would be ridiculous and it would run ever so slowly. Is there a quick sort command that will do something like this?
 
Old 09-09-2008, 11:15 AM   #2
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
I also cannot spell, its Not for the faint of heart

But I wanted to clarify, note the whole string in on a single line, just up till the identifier so you basically have 16 columns per line delimited by a tab.

So each line starts with Strep_excerpt_1900_100_* and then on
 
Old 09-09-2008, 11:37 AM   #3
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
Here am I, faint of heart, expecting a helpful note. Now I'm scared! :-)

The file doesn't easily fit in memory then? Hmm.

Well, you could use a script to split the file by TAB, write the result to a temporary file, sort that file with the sort command, and then re-construct the tab-delimited structure.

sort will use a lot of disk space, but it should do better than a Perl script sucking it all up into memory if the data set is too large.
 
Old 09-09-2008, 11:38 AM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
Maybe I'm missing something, but why not simply...?
Code:
sort file > newfile
Edit: ok, I got the point looking at the post by matthewg!

Last edited by colucix; 09-09-2008 at 11:40 AM.
 
Old 09-09-2008, 11:39 AM   #5
CRC123
Member
 
Registered: Aug 2008
Distribution: opensuse, RHEL
Posts: 374
Blog Entries: 1

Rep: Reputation: 32
Quote:
Originally Posted by bioinformatics_guy View Post
I was thinking of doing this in perl but the memory requirement would be ridiculous and it would run ever so slowly. Is there a quick sort command that will do something like this?
perl is interpreted as is bash and depending on how large your file is, bash would take a similar amount of memory as perl, possibly more.

For clarification; the entire file is all on one line? or just each entry starting with Strep and ending with /?. Your post would lead us to believe that there are multiple lines. Try using the QUOTE or CODE formatting buttons on the LQ post window
 
Old 09-09-2008, 09:35 PM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
Actully, Perl is NOT interpreted as per bash. The actual perl prog (not the one you write) reads your prog and basically compiles it on the fly and runs the 'compiled' version.
For the gory details :
http://www.perl.com/doc/FMTEYEWTK/comp-vs-interp.html
The net effect is about 80-90% speed of C.

OP qn: I agree, its not clear where the end of record is.
Does each rec start with 'Strep_excerp..' and end with a newline, or have you got multiple 'logical' ( Strep_excerp) recs on each physical line?
The other cols are tab separated I understand.
As mentioned, please the CODE tags option.
 
Old 09-10-2008, 06:28 AM   #7
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
That is correct, each line Starts with "Strep_excert...." and ends with a new line so there is hundreds of lines basically.

In essence:

Strep_12234/1 Strep 35 4545 2356 AAAAAAAA
Strep_13434/2 Strep 34 4535 3456 ATATATAT
Strep_96849/1 Strep 45 9595 8372 AAAATAAA
...

The first strep tag comes in pairs that are denoted by a \1 or \2 being the pair. I want sort so that the \1 is followed by the \2 of the same strep tag.

Any thoughts?

Also where is the code button?
 
Old 09-10-2008, 07:48 AM   #8
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
Ah! I thought there were no new lines in the file. In that case, why not just use sort?
Code:
sort infile > outfile
 
Old 09-10-2008, 07:56 AM   #9
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
Holy..... how is it that easy? For some reason I thought it was going to be much harder -- like a regexp or something
 
Old 09-10-2008, 09:19 AM   #10
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976Reputation: 1976
You make me think it was a memory problem due to the huge file size. I suggested the same solution in post #4
 
Old 09-10-2008, 11:34 AM   #11
bioinformatics_guy
Member
 
Registered: Aug 2008
Posts: 54

Original Poster
Rep: Reputation: 15
The problem is, it will be a memory problem when I take it to production but right now, its fine.

In production... Id have to sort a few hundred million lines which might take awhile. I'm looking at some merge sort algorithms now.
 
Old 09-10-2008, 11:55 AM   #12
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 727

Rep: Reputation: 74
Hi.
Quote:
Originally Posted by bioinformatics_guy View Post
... Also where is the code button?
It's the # just above the editing window. So you select the text, then click the symbol, and you get:
Code:
something line this
cheers, makyo
 
Old 09-10-2008, 12:40 PM   #13
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 670Reputation: 670Reputation: 670Reputation: 670Reputation: 670Reputation: 670
Quote:
Strep_12234/1 Strep 35 4545 2356 AAAAAAAA
Strep_13434/2 Strep 34 4535 3456 ATATATAT
Strep_96849/1 Strep 45 9595 8372 AAAATAAA
Using just sort on millions of lines may cause an out of memory error, or caching to the swap partition.

What does the number before the /1 or /2 signify, and do they appear in random locations. You may be able to exploit the fact that there are two of them in some way. Another possibility is splitting up the original file into separate files based on the number before the slash and then sorting the temporary slices before cat'ing them together.

Looking at your first example, the entries look presorted. If all /1 lines are presorted and all /2 lines are presorted, but the two types of lines together aren't, you could filter all /1 lines in one stream, all /2 lines in another stream and then merge the streams together with "sort -m". I don't think you would have the same memory problem. Each of the two inputs needs to be presorted, so the program doesn't need to hold all of the lines in memory. As soon as a line from both streams is greater then the previous line read in, the previous line can be printed out and the memory reclaimed.
Code:
sed -n '/[^/]*\/1/w excerpt_1' testfile &
[2] 19150
jschiwal@hpmedia:~> sed -n '/[^/]*\/2/w excerpt_2' testfile &
[3] 19192
jschiwal@hpmedia:~> sort -m excerpt_1 excerpt_2
Strep_excerpt_1900_100_15_164_145/1 Strep_excerpt_1900_100 15 + 150 18 99 81 63 0 0 1 0 35 GaGGAGTGCTGGAACGCATAGAAGtGAgAatGCcG ?2??????????????????????4?<5?'9??/?
Strep_excerpt_1900_100_18_167_365/2 Strep_excerpt_1900_100 18 + 150 18 99 57 57 1 10 0 1 35 GAGTGCTGGAACGcaTAGAAGTGtgAAtgCCggta ??????<??????.+????????+7??&0<?7;**
Strep_excerpt_1900_100_21_170_3bb/1 Strep_excerpt_1900_100 21 + 150 18 99 61 61 1 5 0 1 35 TGCTGGAACGCATAGAAGtGAtAATGCCggtAtgA ????????<??????<?<9<?&??????:$9?/;?
Strep_excerpt_1900_100_22_171_3d0/1 Strep_excerpt_1900_100 22 + 150 18 99 84 78 0 0 1 0 35 GCTGGAACGCATAGAAgTGAGaAtGCCGGTaTGAG ????????????????7?<??:?:<?????:??<?
Strep_excerpt_1900_100_23_172_8b/1 Strep_excerpt_1900_100 23 + 150 18 99 81 81 0 0 1 0 35 CTGGAAcGCATAGAAGTGAGAATGCCGGTAtgaGT ??????7<???<??????>>???>?????</85??
Strep_excerpt_1900_100_25_174_11c/2 Strep_excerpt_1900_100 25 + 150 18 99 78 78 0 0 1 0 35 GGAACGCATAGAAGTGagAATGCCGGTatgagTAG ?????????<?????<0.?????>>??+,9/;???
Strep_excerpt_1900_100_26_175_15b/1 Strep_excerpt_1900_100 26 + 150 18 99 78 78 0 0 1 0 35 GaACGCaTAGaAGTGAGAAtGCCGGtaTGAGtAGc ?:????9???:???????<.?>???27>?<<;?=&
[2]-  Done                    sed -n '/[^/]*\/1/w excerpt_1' testfile
[3]+  Done                    sed -n '/[^/]*\/2/w excerpt_2' testfile
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Sort command arya6000 Linux - Newbie 2 11-27-2007 08:50 PM
Sort Command saravanan1979 Programming 1 10-03-2004 12:36 PM
Note to newbies on ls command!! samills70 Linux - Newbie 1 06-25-2004 09:47 AM
Recompiling kernel - not for the faint of heart? Linus VanPelt Linux - Newbie 3 03-31-2003 01:36 PM
Using the Sort command in vi timnphx Programming 2 04-07-2001 12:39 AM


All times are GMT -5. The time now is 11:52 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration