Hi all,
I'm a biologist working with alignments of amino acid sequences from different species and trying to get rid of sequences that are poorly aligned.
Here's an example alignment:
Code:
>ARIC@m.6424
GIFAIVATCMSFVQIAQHFFHQTHHPSQKRIMRILAMVPVYGATSWISILFFQSAIYMDFIKACFEAYIIYCFLLLLTKYLHERIILPFPCCCLKVRP-NVRWVWYFKIGLLQYSWIAPLCSGIAVVLNLADVYANGEWKFNRGYITIIINISQTLALYSLVTFYTNAKGELRPFQPLPKFIVIKLIVFFIFWQSVLMSGLAAIH----------------------------------------
>BPLI@10003771
----MMAIPISLWGILQHLVNYTQPNLQRHIIRILWMVPIYAINAWFALRFPSASIYLDTLRECYEAYVIYNFMAYLLNYLKEQVKHICPFCCFPPWQMKYSFIDRCKHGALQYTIVRPVTTCIALVCQLNGAYNEGDFDFKSAYLTIINNISQIWAMYCLVLFYKAMKEELAPIKPIPKFLCVKFVVFFSFWQSVLIAILVKLDWIQDFLICIEMFLAAIAHYFSFSHKPMWDVSDMRDDVIEH
>DMEL@FBpp0110243
GLFVLSAVPVSIWHIIQHVIHFTKPILQKHIIRILWMVPIYALNAWIGLFFPKHSIYVDSLRECYEAYVIYNFMVYLLNYLKPQVPHFFPLCCMRPWVMGREFIHNCKHGILQYTVVRPITTFISVICELCGVYGEGEFAGNVAYIVVVNNISQFVAMYCLVLFYRANKEDLKPMKPIPKFLCIKAVVFFSFFQGVLLNVLVYYNIIQNFLICIEMFIAAVAHIYSFPHHPMMDISDMQEDVTEH
>DPUL@190611
GLFTIMAVPISLWDITQHLVHYNKPHMQKYIIRKLLLDSLKTFIAWVGLSFPNYAIYLDSCRECYEAYVIYNFMMFLLTYLKTHIHHIFPLCCLKPWPMGSELIHRCKHGILQYTIVRPLSAFISVICEINGVYAEGKFLTNVAYMIAINNLSQFVAMYHLILFYRAHREALQPMSPIGKFLCIKAVVFFSFFQGVIIAILFYTGAIQNFLICIEMFLAAVAHHFSFSYRPMWDVSDVRRDVAEH
>GEBO@m.27752
GIFTIIAVPISLWGIFNHLINYTQPHLQRYIIRILWMVPIYGMNAWFALRFPDAAIYLDTLRECYEAYVIYNFMSYLLSYLKPQQKHLIPFCCLPKWPMGNTLIQRCKHGVLQYTVVRPATTVIALACEIGGKYGEGDFNFTTAYLVIVNNISQIWAMYNLILFYKATKDELKPIKPVPKFLCVKFVVFLSFWQSVVIAALVKLNVIQDFLICIEMFIASIAHYYSFSHKPMWDVTDVRNDFIEH
>GEBO@m.27757
---------------------------------ILWMVPIYGMNAWFALRFPDAAIYLDTLRECYEAYVIYNFMSYLLSYLKPQQKHLIPFCCLPKWPMGNTLIQRCKHGVLQYTVVRPATTVIALACEIGGKYGEGDFNFTTAYLVIVNNISQIWAMYNLILFYKATKDELKPIKPVPKFLCVKFVVFLSFWQSVVIAALVKLNVIQDFLICIEMFIASIAHYYSFSHKPMWDVTDVRNDFIEH
>GPYR@m.58820
GLFVMMAIPISLWGILQHLVNYTQPHLQRYIIRILWMVPIYAMNAWFALRFPDEAIYLDTLRECYEAYVIYNFFAYLMSFLKQQVKHIIPFCCLPPWESGEKFLTHCKHGALQYTVIRPVTTIIALICALAGVYDEGHFSFKNAYIVVINNVSQIWAMYCLVLFYHSLKEELQPLKPVSKFLCVKFVVFFSFWQAVLIAILEAAGVIQDFCICVEMFIAAVAHYYSFPHGPMFNVSDLRDDVVEH
>LCAL@m.5119
-------------------------------------------------------------------------------------------------------------------PLRPGTTILALICEMAHKYDEGSFSFKSAYLTIINNISQVWALYCLVLFYKATKKELAPINPIWKFACVKFVVFFSFWQAVFISILEAAGAIQDFAICIEMFLAALAHHYSFTSQPMWDVSDVRNDVVEH
>LRUG@m.12391
GIFVLMALPISFWGILQHLVYYTQPHLQRHIIRVLWMVPIYALNAWLGLRFPEAAIFLDTLRECYEAYVIYNFMSYLLVYLKPSVKHIIPFCCLPKWQLGGSFIQRCKHGVLQYTIVRPTTTIIALICQLAGVYEEGDFSFNSAYIAIINNISQIWAMYCLVMFYRGTKEELKPINPIPKFMCVKAVVFLSFWQSVVIAALVKLGAIQDFLICIEMFLAAIAHYYSFSHKPMWDVSDVRDDVVEH
>MACR@m.7342
SLFIFVLTLCIYIEEMYFIYASDKFVSWRRRINILTTCPVLSALALLGIFFQRSSYLCDLVSGVYFSIIIFQFVSLMINYYLLPRRQPFCYLCLPPIVVNRKVIRNIKMLSLQTAIVQPILLYFSAVLYVDGLVIPGQLSLTNAYLQFLVLISTMSAMYGVLLMRDLTFPAIKEYHMLPKLLFIFAIMFITRIQPIILGFFATINFIHHYIMVFELLIMAILVRIIYRRVPEEPTSLLLSREPKS
>NGRE@m.7996
GVFAMMAIPISLWGILQHMVNYTQPNLQRHIIRILWMVPIYALNASLALRFPSSSIYLDVVRECYEAYVIYNFTCYLMNFLKPRVTHIIPFCCLPRWPMGQVYIDRCRHGVLQYTIVRPAMTITAVLCEMTNSYAEGDFNPTKGYITIINNISQIWAMYCLVMFYRATKEELSPIHPVPKFMCVKAVVFLSFWQAVFLAILSSFGVLQDFLICIEMFLAAIAHYFSFSHRPMWDISDVGEDVADH
>NSUC@m.63025
GFFVVMTIPISLWGILQHIIHYNQPVLQRHIVRVLWMVPIYAGDAWFALRFKEAAIYLDTMRECYEAYVIYNFMAYLMAFLKPQVKHLFPFCCLPSWTMGSSLINLCKHGALQYTVIRPVTTCIALITQLAGVYEEGDFSFKSAYLAIINNISQVWAMYCLILFYFATKEELQPMKPVSKFMCVKAVVFMSFWQQVLIAALVKLGVI--------------------------------------
>OVUL@m.3794
GLFVMMALPISLWGILQHLIHYTQAHLQRHIIRILWMVPIYALNAWFALRFPQAAIYLDTLRECYEAYVIYNFMSYLLSYLKSQVNHFFPFCLHPPWPMGRKFIRNCKHGVLQYTIIRPTTTVLALITALCGKYDEGEFNVDSAYLVFINNASQVWAMYNLILFYRAVKEELQPINPIPKFLCVKAVVFLSFWQAVLIAA---------------------------------------------
>PEDI@m.785
GIFVMGAVPISLWGILQHLIHYTKPHLQRHIIRILWMVPIYAVNAWFGLRFPNIAIYLDTLRECYEAYVIYNFLSYLLNFLKPQQKHLFPLCKFPPWTMGRPLIQRCRHGAMQYTVIRPVTTVIALICQLCGKYEEGNFSVHSAYLTIINNISQVWAMYCLVLFYKACKEELAPMKPVSKFLCVKFVVFMSFWQAVAIAILVEVGAIQDFCICIEMFVAEIG-----------------------
>FAKE@trash
-GIFVMGAVPISLWGILQHLIHYTKPHLQRHIIRILWMVPIYAVNAWFGLRFPNIAIYLDTLRECYEAYVIYNFLSYLLNFLKPQQKHLFPLCKFPPWTMGRPLIQRCRH-GAMQYTVIRPVTTVIALICQLCGKYEGNFSVHSAYLTIINNISQVWAMYCLVLFYKACKEELAPMKPVSKFLCVKFVVFMSFWQAVAIAILVEVGAIQDFCICIEMFVAEIG----------------------
I'm using a program (called infoalign) that gives a score to each sequence. Here's an example of its output:
Code:
ARIC@m.6424 60.000000
BPLI@10003771 17.842323
DMEL@FBpp0110243 33.469387
DPUL@190611 40.408165
GEBO@m.27752 20.816326
GEBO@m.27757 20.754717
GPYR@m.58820 26.122450
LCAL@m.5119 32.307693
LRUG@m.12391 20.000000
MACR@m.7342 79.591835
NGRE@m.7996 31.428572
NSUC@m.63025 26.086956
OVUL@m.3794 22.500000
PEDI@m.785 23.423424
FAKE@trash 95.495499
In a nutshell, I want to exclude sequences with a score >75. Note that the last sequence ("FAKE") in the above example is really bad and should be excluded.
Here's where I'm at but this script isn't deleting any sequences.
Code:
#!/bin/bash
for FileName in *.fa #The actual amino acid alignment is the .fa file
do
infoalign -only -name -change -sequence $FileName -outfile $FileName.info #Generates the .info file with scores for each sequence
unset arr
unset line
unset num
unset tag
unset whole
unset IFS
declare -a arr
while read -r line
do
set -- $line
num=$2;tag=$1
IFS="."; set -- $num
whole=$1
if [ $whole -gt 75 ];then
arr+=($tag)
fi
unset IFS
done < $FileName.info
exec 4< $FileName
while read -r line <&4
#I'm pretty sure the problem is in the below code but I don't know what I'm doing wrong.
do
case "$line" in
">"*)
flag=0
set -- $line
for i in ${arr[@]}
do
if [ "$i" = "$2" ];then
flag=1
fi
done
esac
[ $flag -eq 1 ] && continue
echo "$line" >> $FileName.trimmed
read NEXT <&4
echo "$NEXT" >> $FileName.trimmed
done
exec 4<&-
done
Any assistance/advice would be greatly appreciated!
Thanks!
Kevin