LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Trying to write a script to sort pairs of lines by the length of the 2nd in the pair (http://www.linuxquestions.org/questions/programming-9/trying-to-write-a-script-to-sort-pairs-of-lines-by-the-length-of-the-2nd-in-the-pair-811748/)

kmkocot 06-02-2010 02:19 PM

Trying to write a script to sort pairs of lines by the length of the 2nd in the pair
 
Hey all,

I have a set of files whose contents look like this:
Code:

>GGWCJUO02HEXBU
CGACATCGCATTGTAACATCGCCACATCGAATTGTGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATATCGCCACATCGCATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGCGCCACATTGCATCGTTACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGAC
>GGWCJUO02DRGID_reverse_compliment
TGTAACATCGCCACATCGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATCGTAACATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGACATCGCCACATTGCATCGTTACATCACATCGCGACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGACCTCGCATCGTAACATTGCCACATCACATCGTAGCATCGCCGCATTGCATCAATAACAT
>GGWCJUO02ITXOV_some_annotation_goes_here
AGACTCTCATCTCACCATAACACAGTATACAACACACTGAGCTCAGACTCTCAATCTCAC
>GGWCJUO02H1N1E there may be spaces in headers
GTGGAAGCGTAGTCGATGAATTACTGGTTTATCGCTGTTATACTCGTGGGTTGAATGCAGATACACGGGAATGTCGTCGCATAATTATGTG

These are DNA sequences where the lines with greater-than signs and some string of text are the header and the following line is the sequence associated with that header. There are never line breaks in the DNA sequence.

I'm trying to write a script to sort these files by descending sequence length. Does anyone have any ideas how I could do this?

Thanks!
Kevin

yooy 06-02-2010 02:30 PM

http://www.linuxscrew.com/2009/04/14...y-line-length/
this may help,..

there are plenty of sorting algorithms in case that you have a lot of lines.
use awk to get first string in each line, to remove header use substring command

grail 06-02-2010 10:07 PM

What is the solution if the length is the same?

makyo 06-02-2010 11:40 PM

Hi.

Command msort is a flexible sorting program. The core of the script below shows how each pair can be made into a block, the sort key is the second line of the pair, the comparison method is the length of the key field, and the empty lines are removed from the sorted result:
Code:

#!/usr/bin/env bash

# @(#) s1        Demonstrate block sort, key length of field, msort.
# See: http://www.billposer.org/Software/msort.html

# Infrastructure details, environment, commands for forum posts.
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe ; pe "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
pe "(Versions displayed with local utility \"version\")"
c=$( ps | grep $$ | awk '{print $NF}' )
version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
[ "$c" = "$s" ] && p="$s" || p="$c"
version >/dev/null 2>&1 && version "=o" $p printf specimen sed msort
set -o nounset
pe

FILE=${1-data1}

# Display sample of data file, with head & tail as a last resort.
pe " || start [ first:middle:last ]"
specimen $FILE \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"

# Add blank line to create a block, sort, remove blank line.
# Record is a block, fields are lines, key is size of field 2.

pl " Results:"
sed '2~2s/$/\n/' $FILE |
msort -b -n2 -c"size" 2>/dev/null |
sed '/^$/d'
pe

exit 0

producing:
Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0
GNU bash 3.2.39
printf - is a shell builtin [bash]
specimen (local) 1.17
GNU sed version 4.1.5
msort - ( /usr/bin/msort Apr 24 2008 )

 || start [ first:middle:last ]
Whole: 5:0:5 of 8 lines in file "data1"
>GGWCJUO02HEXBU
CGACATCGCATTGTAACATCGCCACATCGAATTGTGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATATCGCCACATCGCATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGCGCCACATTGCATCGTTACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGAC
>GGWCJUO02DRGID_reverse_compliment
TGTAACATCGCCACATCGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATCGTAACATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGACATCGCCACATTGCATCGTTACATCACATCGCGACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGACCTCGCATCGTAACATTGCCACATCACATCGTAGCATCGCCGCATTGCATCAATAACAT
>GGWCJUO02ITXOV_some_annotation_goes_here
AGACTCTCATCTCACCATAACACAGTATACAACACACTGAGCTCAGACTCTCAATCTCAC
>GGWCJUO02H1N1E there may be spaces in headers
GTGGAAGCGTAGTCGATGAATTACTGGTTTATCGCTGTTATACTCGTGGGTTGAATGCAGATACACGGGAATGTCGTCGCATAATTATGTG
 || end

-----
 Results:
>GGWCJUO02ITXOV_some_annotation_goes_here
AGACTCTCATCTCACCATAACACAGTATACAACACACTGAGCTCAGACTCTCAATCTCAC
>GGWCJUO02H1N1E there may be spaces in headers
GTGGAAGCGTAGTCGATGAATTACTGGTTTATCGCTGTTATACTCGTGGGTTGAATGCAGATACACGGGAATGTCGTCGCATAATTATGTG
>GGWCJUO02HEXBU
CGACATCGCATTGTAACATCGCCACATCGAATTGTGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATATCGCCACATCGCATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGCGCCACATTGCATCGTTACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGAC
>GGWCJUO02DRGID_reverse_compliment
TGTAACATCGCCACATCGCCATCGTGACATCGCGTCGTAACATTGCATCGTAACATTGCATCGCAACATTGCATCGTAACATTGTGCCATCGTGACATCGCATCGTAACATTGCCACATCACACCGTGACATCGCCACATTGCATCGTTACATCACATCGCGACATCGCATCGTAACATCGCCACATCGCATCTTGACATCGCATCGCGACCTCGCATCGTAACATTGCCACATCACATCGTAGCATCGCCGCATTGCATCAATAACAT

The msort was in my Debian repository. See the URL noted in the script for the home directory plus the long documentation pdf.

Best wishes ... cheers, makyo


All times are GMT -5. The time now is 04:59 PM.