LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 06-02-2010, 01:19 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 98

Rep: Reputation: 15
Question Trying to write a script to sort pairs of lines by the length of the 2nd in the pair


Hey all,

I have a set of files whose contents look like this:
Code:
>GGWCJUO02HEXBU
CGACATCGCATTGTAACATCGCCACATCGAATTGTGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATATCGCCACATCGCATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGCGCCACATTGCATCGTTACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGAC
>GGWCJUO02DRGID_reverse_compliment
TGTAACATCGCCACATCGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATCGTAACATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGACATCGCCACATTGCATCGTTACATCACATCGCGACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGACCTCGCATCGTAACATTGCCACATCACATCGTAGCATCGCCGCATTGCATCAATAACAT
>GGWCJUO02ITXOV_some_annotation_goes_here
AGACTCTCATCTCACCATAACACAGTATACAACACACTGAGCTCAGACTCTCAATCTCAC
>GGWCJUO02H1N1E there may be spaces in headers
GTGGAAGCGTAGTCGATGAATTACTGGTTTATCGCTGTTATACTCGTGGGTTGAATGCAGATACACGGGAATGTCGTCGCATAATTATGTG
These are DNA sequences where the lines with greater-than signs and some string of text are the header and the following line is the sequence associated with that header. There are never line breaks in the DNA sequence.

I'm trying to write a script to sort these files by descending sequence length. Does anyone have any ideas how I could do this?

Thanks!
Kevin
 
Old 06-02-2010, 01:30 PM   #2
yooy
Senior Member
 
Registered: Dec 2009
Posts: 1,116

Rep: Reputation: 127Reputation: 127
http://www.linuxscrew.com/2009/04/14...y-line-length/
this may help,..

there are plenty of sorting algorithms in case that you have a lot of lines.
use awk to get first string in each line, to remove header use substring command
 
Old 06-02-2010, 09:07 PM   #3
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,442

Rep: Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880Reputation: 1880
What is the solution if the length is the same?
 
Old 06-02-2010, 10:40 PM   #4
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.

Command msort is a flexible sorting program. The core of the script below shows how each pair can be made into a block, the sort key is the second line of the pair, the comparison method is the length of the key field, and the empty lines are removed from the sorted result:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate block sort, key length of field, msort.
# See: http://www.billposer.org/Software/msort.html

# Infrastructure details, environment, commands for forum posts. 
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe ; pe "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
pe "(Versions displayed with local utility \"version\")"
c=$( ps | grep $$ | awk '{print $NF}' )
version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
[ "$c" = "$s" ] && p="$s" || p="$c"
version >/dev/null 2>&1 && version "=o" $p printf specimen sed msort
set -o nounset
pe

FILE=${1-data1}

# Display sample of data file, with head & tail as a last resort.
pe " || start [ first:middle:last ]"
specimen $FILE \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"

# Add blank line to create a block, sort, remove blank line.
# Record is a block, fields are lines, key is size of field 2.

pl " Results:"
sed '2~2s/$/\n/' $FILE |
msort -b -n2 -c"size" 2>/dev/null |
sed '/^$/d'
pe

exit 0
producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
printf - is a shell builtin [bash]
specimen (local) 1.17
GNU sed version 4.1.5
msort - ( /usr/bin/msort Apr 24 2008 )

 || start [ first:middle:last ]
Whole: 5:0:5 of 8 lines in file "data1"
>GGWCJUO02HEXBU
CGACATCGCATTGTAACATCGCCACATCGAATTGTGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATATCGCCACATCGCATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGCGCCACATTGCATCGTTACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGAC
>GGWCJUO02DRGID_reverse_compliment
TGTAACATCGCCACATCGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATCGTAACATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGACATCGCCACATTGCATCGTTACATCACATCGCGACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGACCTCGCATCGTAACATTGCCACATCACATCGTAGCATCGCCGCATTGCATCAATAACAT
>GGWCJUO02ITXOV_some_annotation_goes_here
AGACTCTCATCTCACCATAACACAGTATACAACACACTGAGCTCAGACTCTCAATCTCAC
>GGWCJUO02H1N1E there may be spaces in headers
GTGGAAGCGTAGTCGATGAATTACTGGTTTATCGCTGTTATACTCGTGGGTTGAATGCAGATACACGGGAATGTCGTCGCATAATTATGTG
 || end

-----
 Results:
>GGWCJUO02ITXOV_some_annotation_goes_here
AGACTCTCATCTCACCATAACACAGTATACAACACACTGAGCTCAGACTCTCAATCTCAC
>GGWCJUO02H1N1E there may be spaces in headers
GTGGAAGCGTAGTCGATGAATTACTGGTTTATCGCTGTTATACTCGTGGGTTGAATGCAGATACACGGGAATGTCGTCGCATAATTATGTG
>GGWCJUO02HEXBU
CGACATCGCATTGTAACATCGCCACATCGAATTGTGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATATCGCCACATCGCATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGCGCCACATTGCATCGTTACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGAC
>GGWCJUO02DRGID_reverse_compliment
TGTAACATCGCCACATCGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATCGTAACATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGACATCGCCACATTGCATCGTTACATCACATCGCGACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGACCTCGCATCGTAACATTGCCACATCACATCGTAGCATCGCCGCATTGCATCAATAACAT
The msort was in my Debian repository. See the URL noted in the script for the home directory plus the long documentation pdf.

Best wishes ... cheers, makyo
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Delete lines from a file by their's length dayamoon Linux - Newbie 15 04-27-2010 07:58 PM
[SOLVED] I can know the length in lines of a man page but not of an info manual. stf92 Linux - Newbie 3 12-10-2009 02:41 AM
Sort Text by length. n0futur3 Linux - Newbie 1 01-24-2009 04:58 AM
Is there a line limit with the sort utility? Trying to sort 130 million lines of text gruffy Linux - General 4 08-10-2006 08:40 PM
How can I sort the lines in a file? windhair Linux - Software 2 11-17-2005 08:37 AM


All times are GMT -5. The time now is 09:02 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration