Hi all,
I have a directory full of amino acid sequence files formatted like this:
Code:
>14424|LGIG|134818
MEIVTTDSVSIVETPTWKRPLELLNS
>14424|NVEC|199771
MADEGSGFSDCSSRQYLNPLQLFEEK
>14424|CAP|112268
MTTYKRPLELLEGDSLVKICAPMVRY
>14424|HROB|107587
MSVEENKFKNSMELFESKDVVKICAP
>14424|Euprymna_scolopes_pep.fas|ESCO_227709683_c_s
QKMEHAGASWVTVHGRTPEQRGEPVN
I want to write a script that will go through and count the number of occurrences of particular strings associated with the taxon names and if there are more than X sequences for a particular taxon, delete the entire file.
Here is the code I have so far using X=1 for this example:
Code:
for FileName in *.fa
do
grep -c ESCO_ $FileName >> taxon_count.txt
grep -c CAP| $FileName >> taxon_count.txt
grep -c LGIG| $FileName >> taxon_count.txt
grep -c NVEC| $FileName >> taxon_count.txt
grep -c HROB| $FileName >> taxon_count.txt
#??? somehow grep for any integer larger than 1 in taxon_count.txt and assign the largest value recovered in the variable $sequences_per_taxon
if [ "$sequences_per_taxon" -gt 1 ] ; then
printf "More than one sequence per taxon in $FileName"
mv $FileName ./rejected/
fi
done
Can anyone help me with the missing step?
Thanks!
Kevin