LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-27-2013, 05:20 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126

Rep: Reputation: 15
Question Need help with shell script involving arrays and for/do/etc. loops


Hi all,

I'm a biologist working with alignments of amino acid sequences from different species and trying to get rid of sequences that are poorly aligned.

Here's an example alignment:
Code:
>ARIC@m.6424
GIFAIVATCMSFVQIAQHFFHQTHHPSQKRIMRILAMVPVYGATSWISILFFQSAIYMDFIKACFEAYIIYCFLLLLTKYLHERIILPFPCCCLKVRP-NVRWVWYFKIGLLQYSWIAPLCSGIAVVLNLADVYANGEWKFNRGYITIIINISQTLALYSLVTFYTNAKGELRPFQPLPKFIVIKLIVFFIFWQSVLMSGLAAIH----------------------------------------
>BPLI@10003771
----MMAIPISLWGILQHLVNYTQPNLQRHIIRILWMVPIYAINAWFALRFPSASIYLDTLRECYEAYVIYNFMAYLLNYLKEQVKHICPFCCFPPWQMKYSFIDRCKHGALQYTIVRPVTTCIALVCQLNGAYNEGDFDFKSAYLTIINNISQIWAMYCLVLFYKAMKEELAPIKPIPKFLCVKFVVFFSFWQSVLIAILVKLDWIQDFLICIEMFLAAIAHYFSFSHKPMWDVSDMRDDVIEH
>DMEL@FBpp0110243
GLFVLSAVPVSIWHIIQHVIHFTKPILQKHIIRILWMVPIYALNAWIGLFFPKHSIYVDSLRECYEAYVIYNFMVYLLNYLKPQVPHFFPLCCMRPWVMGREFIHNCKHGILQYTVVRPITTFISVICELCGVYGEGEFAGNVAYIVVVNNISQFVAMYCLVLFYRANKEDLKPMKPIPKFLCIKAVVFFSFFQGVLLNVLVYYNIIQNFLICIEMFIAAVAHIYSFPHHPMMDISDMQEDVTEH
>DPUL@190611
GLFTIMAVPISLWDITQHLVHYNKPHMQKYIIRKLLLDSLKTFIAWVGLSFPNYAIYLDSCRECYEAYVIYNFMMFLLTYLKTHIHHIFPLCCLKPWPMGSELIHRCKHGILQYTIVRPLSAFISVICEINGVYAEGKFLTNVAYMIAINNLSQFVAMYHLILFYRAHREALQPMSPIGKFLCIKAVVFFSFFQGVIIAILFYTGAIQNFLICIEMFLAAVAHHFSFSYRPMWDVSDVRRDVAEH
>GEBO@m.27752
GIFTIIAVPISLWGIFNHLINYTQPHLQRYIIRILWMVPIYGMNAWFALRFPDAAIYLDTLRECYEAYVIYNFMSYLLSYLKPQQKHLIPFCCLPKWPMGNTLIQRCKHGVLQYTVVRPATTVIALACEIGGKYGEGDFNFTTAYLVIVNNISQIWAMYNLILFYKATKDELKPIKPVPKFLCVKFVVFLSFWQSVVIAALVKLNVIQDFLICIEMFIASIAHYYSFSHKPMWDVTDVRNDFIEH
>GEBO@m.27757
---------------------------------ILWMVPIYGMNAWFALRFPDAAIYLDTLRECYEAYVIYNFMSYLLSYLKPQQKHLIPFCCLPKWPMGNTLIQRCKHGVLQYTVVRPATTVIALACEIGGKYGEGDFNFTTAYLVIVNNISQIWAMYNLILFYKATKDELKPIKPVPKFLCVKFVVFLSFWQSVVIAALVKLNVIQDFLICIEMFIASIAHYYSFSHKPMWDVTDVRNDFIEH
>GPYR@m.58820
GLFVMMAIPISLWGILQHLVNYTQPHLQRYIIRILWMVPIYAMNAWFALRFPDEAIYLDTLRECYEAYVIYNFFAYLMSFLKQQVKHIIPFCCLPPWESGEKFLTHCKHGALQYTVIRPVTTIIALICALAGVYDEGHFSFKNAYIVVINNVSQIWAMYCLVLFYHSLKEELQPLKPVSKFLCVKFVVFFSFWQAVLIAILEAAGVIQDFCICVEMFIAAVAHYYSFPHGPMFNVSDLRDDVVEH
>LCAL@m.5119
-------------------------------------------------------------------------------------------------------------------PLRPGTTILALICEMAHKYDEGSFSFKSAYLTIINNISQVWALYCLVLFYKATKKELAPINPIWKFACVKFVVFFSFWQAVFISILEAAGAIQDFAICIEMFLAALAHHYSFTSQPMWDVSDVRNDVVEH
>LRUG@m.12391
GIFVLMALPISFWGILQHLVYYTQPHLQRHIIRVLWMVPIYALNAWLGLRFPEAAIFLDTLRECYEAYVIYNFMSYLLVYLKPSVKHIIPFCCLPKWQLGGSFIQRCKHGVLQYTIVRPTTTIIALICQLAGVYEEGDFSFNSAYIAIINNISQIWAMYCLVMFYRGTKEELKPINPIPKFMCVKAVVFLSFWQSVVIAALVKLGAIQDFLICIEMFLAAIAHYYSFSHKPMWDVSDVRDDVVEH
>MACR@m.7342
SLFIFVLTLCIYIEEMYFIYASDKFVSWRRRINILTTCPVLSALALLGIFFQRSSYLCDLVSGVYFSIIIFQFVSLMINYYLLPRRQPFCYLCLPPIVVNRKVIRNIKMLSLQTAIVQPILLYFSAVLYVDGLVIPGQLSLTNAYLQFLVLISTMSAMYGVLLMRDLTFPAIKEYHMLPKLLFIFAIMFITRIQPIILGFFATINFIHHYIMVFELLIMAILVRIIYRRVPEEPTSLLLSREPKS
>NGRE@m.7996
GVFAMMAIPISLWGILQHMVNYTQPNLQRHIIRILWMVPIYALNASLALRFPSSSIYLDVVRECYEAYVIYNFTCYLMNFLKPRVTHIIPFCCLPRWPMGQVYIDRCRHGVLQYTIVRPAMTITAVLCEMTNSYAEGDFNPTKGYITIINNISQIWAMYCLVMFYRATKEELSPIHPVPKFMCVKAVVFLSFWQAVFLAILSSFGVLQDFLICIEMFLAAIAHYFSFSHRPMWDISDVGEDVADH
>NSUC@m.63025
GFFVVMTIPISLWGILQHIIHYNQPVLQRHIVRVLWMVPIYAGDAWFALRFKEAAIYLDTMRECYEAYVIYNFMAYLMAFLKPQVKHLFPFCCLPSWTMGSSLINLCKHGALQYTVIRPVTTCIALITQLAGVYEEGDFSFKSAYLAIINNISQVWAMYCLILFYFATKEELQPMKPVSKFMCVKAVVFMSFWQQVLIAALVKLGVI--------------------------------------
>OVUL@m.3794
GLFVMMALPISLWGILQHLIHYTQAHLQRHIIRILWMVPIYALNAWFALRFPQAAIYLDTLRECYEAYVIYNFMSYLLSYLKSQVNHFFPFCLHPPWPMGRKFIRNCKHGVLQYTIIRPTTTVLALITALCGKYDEGEFNVDSAYLVFINNASQVWAMYNLILFYRAVKEELQPINPIPKFLCVKAVVFLSFWQAVLIAA---------------------------------------------
>PEDI@m.785
GIFVMGAVPISLWGILQHLIHYTKPHLQRHIIRILWMVPIYAVNAWFGLRFPNIAIYLDTLRECYEAYVIYNFLSYLLNFLKPQQKHLFPLCKFPPWTMGRPLIQRCRHGAMQYTVIRPVTTVIALICQLCGKYEEGNFSVHSAYLTIINNISQVWAMYCLVLFYKACKEELAPMKPVSKFLCVKFVVFMSFWQAVAIAILVEVGAIQDFCICIEMFVAEIG-----------------------
>FAKE@trash
-GIFVMGAVPISLWGILQHLIHYTKPHLQRHIIRILWMVPIYAVNAWFGLRFPNIAIYLDTLRECYEAYVIYNFLSYLLNFLKPQQKHLFPLCKFPPWTMGRPLIQRCRH-GAMQYTVIRPVTTVIALICQLCGKYEGNFSVHSAYLTIINNISQVWAMYCLVLFYKACKEELAPMKPVSKFLCVKFVVFMSFWQAVAIAILVEVGAIQDFCICIEMFVAEIG----------------------
I'm using a program (called infoalign) that gives a score to each sequence. Here's an example of its output:
Code:
ARIC@m.6424   60.000000
BPLI@10003771 17.842323
DMEL@FBpp0110243	33.469387
DPUL@190611   40.408165
GEBO@m.27752  20.816326
GEBO@m.27757  20.754717
GPYR@m.58820  26.122450
LCAL@m.5119   32.307693
LRUG@m.12391  20.000000
MACR@m.7342   79.591835
NGRE@m.7996   31.428572
NSUC@m.63025  26.086956
OVUL@m.3794   22.500000
PEDI@m.785    23.423424
FAKE@trash    95.495499
In a nutshell, I want to exclude sequences with a score >75. Note that the last sequence ("FAKE") in the above example is really bad and should be excluded.

Here's where I'm at but this script isn't deleting any sequences.

Code:
#!/bin/bash
for FileName in *.fa #The actual amino acid alignment is the .fa file
do
infoalign -only -name -change -sequence $FileName -outfile $FileName.info #Generates the .info file with scores for each sequence
unset arr
unset line
unset num
unset tag
unset whole
unset IFS
declare -a arr
while read -r line
do
    set -- $line
    num=$2;tag=$1
    IFS="."; set -- $num
    whole=$1
    if [ $whole -gt 75 ];then
        arr+=($tag)
    fi
    unset IFS
done < $FileName.info
exec 4< $FileName
while read -r line <&4

#I'm pretty sure the problem is in the below code but I don't know what I'm doing wrong.

do
    case "$line" in
        ">"*)
            flag=0
            set -- $line
            for i in ${arr[@]}
            do
                if [ "$i" = "$2" ];then
                    flag=1
                fi
            done
    esac
    [ $flag -eq 1 ] && continue
    echo "$line" >> $FileName.trimmed
    read NEXT <&4
    echo "$NEXT" >> $FileName.trimmed
done
exec 4<&-
done
Any assistance/advice would be greatly appreciated!

Thanks!
Kevin
 
Old 08-27-2013, 05:37 PM   #2
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,912

Rep: Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513
This might be off the wall, but why not use awk:

awk '{if ($NF <= 75) print $1;}' inputfile >output_file

The $NF is the value of the last field, and could have been $2 if there are only two fields in the file.

This would reduce your script to something like:
Code:
for FileName in *.fa; do
    infoalign -only -name -change -sequence $FileName -outfile $FileName.info
    awk '{if ($NF <= 75) print $1;}' ${FileName}.info >>${fileName}.trimmed
done
If infoalign can output a stream to stdout, then it can be piped directly into awk (assuming you don't need the .info file afterwards). If you just need a single ".trimmed" file as a result then

Code:
for FileName in *.fa; do
    infoalign -only -name -change -sequence $FileName -outfile $FileName.info
    awk '{if ($NF <= 75) print $1;}' ${FileName}.info
done >>output.trimmed
would do.

Last edited by jpollard; 08-27-2013 at 05:50 PM. Reason: stupid typo (I left out the "for")
 
Old 08-28-2013, 10:33 AM   #3
kmkocot
Member
 
Registered: Dec 2007
Location: Tuscaloosa, AL
Posts: 126

Original Poster
Rep: Reputation: 15
Thanks jpollard but I still need a step to go from having a list of the sequences I want to producing a new file containing only those sequences. Once I saw the file outside of an array, I remembered a program called select_contigs that can help with that step.

Here's the code I came up with that seems to be working well:
Code:
#Remove highly divergent sequences using the EMBOSS program infoalign
echo "Removing highly divergent sequences using the EMBOSS program infoalign..."
for FILENAME in *.fa
do
ORTHOLOGY_GROUP=`echo $FILENAME | cut -d . -f 1`
infoalign -only -name -change -sequence $FILENAME -outfile $ORTHOLOGY_GROUP".info"
awk '{if ($NF <= 75) print $1;}' $ORTHOLOGY_GROUP".info" >> $ORTHOLOGY_GROUP".list" #Change the value from 75 here to whatever is needed
select_contigs.pl -e -p -l 9999 -n $ORTHOLOGY_GROUP".list" $FILENAME $FILENAME.reduced
done
mkdir infoalign_files
mv *.info ./infoalign_files/
mv *.list ./infoalign_files/
mv *.fa ./infoalign_files/
rename 's/.fa.reduced/.fa/g' *.fa.reduced
echo Done
echo
Thanks!
Kevin
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Shell script defining for loops of parameters of separate program LiquoriceHats Linux - Newbie 13 07-24-2013 06:52 AM
arrays in shell script himu3118 Programming 5 04-13-2010 07:16 AM
using of arrays in loops relikwie Programming 8 09-05-2009 11:56 AM
using of arrays in loops relikwie Programming 1 09-03-2009 11:23 AM
Brian cramp with shell script for loops SheldonPlankton Programming 4 07-16-2004 06:45 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 08:24 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration