I have a directory full of amino acid sequence files formatted like this:
I want to write a script that will go through and count the number of occurrences of particular strings associated with the taxon names and if there are more than X sequences for a particular taxon, delete the entire file.
Here is the code I have so far using X=1 for this example:
for FileName in *.fa
grep -c ESCO_ $FileName >> taxon_count.txt
grep -c CAP| $FileName >> taxon_count.txt
grep -c LGIG| $FileName >> taxon_count.txt
grep -c NVEC| $FileName >> taxon_count.txt
grep -c HROB| $FileName >> taxon_count.txt
#??? somehow grep for any integer larger than 1 in taxon_count.txt and assign the largest value recovered in the variable $sequences_per_taxon
if [ "$sequences_per_taxon" -gt 1 ] ; then
printf "More than one sequence per taxon in $FileName"
mv $FileName ./rejected/
Can anyone help me with the missing step?