LinuxQuestions.org - grep a file for any interger larger than X and assign largest value to a variable

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - grep a file for any interger larger than X and assign largest value to a variable (https://www.linuxquestions.org/questions/linux-newbie-8/grep-a-file-for-any-interger-larger-than-x-and-assign-largest-value-to-a-variable-767784/)

grep a file for any interger larger than X and assign largest value to a variable

Hi all,

I have a directory full of amino acid sequence files formatted like this:

Code:

>14424|LGIG|134818

MEIVTTDSVSIVETPTWKRPLELLNS

>14424|NVEC|199771

MADEGSGFSDCSSRQYLNPLQLFEEK

>14424|CAP|112268

MTTYKRPLELLEGDSLVKICAPMVRY

>14424|HROB|107587

MSVEENKFKNSMELFESKDVVKICAP

>14424|Euprymna_scolopes_pep.fas|ESCO_227709683_c_s

QKMEHAGASWVTVHGRTPEQRGEPVN

I want to write a script that will go through and count the number of occurrences of particular strings associated with the taxon names and if there are more than X sequences for a particular taxon, delete the entire file.

Here is the code I have so far using X=1 for this example:

Code:

for FileName in *.fa

do

grep -c ESCO_ $FileName >> taxon_count.txt

grep -c CAP| $FileName >> taxon_count.txt

grep -c LGIG| $FileName >> taxon_count.txt

grep -c NVEC| $FileName >> taxon_count.txt

grep -c HROB| $FileName >> taxon_count.txt





#??? somehow grep for any integer larger than 1 in taxon_count.txt and assign the largest value recovered in the variable $sequences_per_taxon



if [ "$sequences_per_taxon" -gt 1 ] ; then

printf "More than one sequence per taxon in $FileName"

mv $FileName ./rejected/

fi

done

Can anyone help me with the missing step?

Thanks!
Kevin

Help us :}

We're not biochemists (most people here, anyway). Which part of the
strings above is your taxon. And as you seem to be slapping all matches
into the same taxon_count.txt, how do you want to go about determining
which files to actually delete?

Cheers,
Tink

use awk for parsing files. Use arrays to store your count. Its also not clear which field is your "taxon", so i assume it is. A sample:
2nd field where line starts with ">"

Code:



awk -F"|" '/^>/{ 

 taxon[$2]++

}

END{

 for(o in  taxon){

    print o,taxon[o]

 }

}

' file

Thanks for the replies.

Tinkster, I forgot to include a line deleting the taxon_count.txt file after each .fa file is processed. The taxon name abbreviations for the sequences I pasted below would be LGIG, NVEC, CAP, HROB, and ESCO respectively. The problem is the 'header line' (the line containing the greater-than sign) varies in format from taxon to taxon but maybe the best thing to do is to reformat them all before processing them with these scripts.

ghostdog74, thanks for your help. My brain is fried for tonight but tomorrow I am going to try to implement your awk suggestion into the script I have now. I may have more questions.

Thanks again to both of you.

Kevin

The problem I ran into is that all my input files are formatted differently so instead of format each one to have the taxon abbreviation in the same field (for awk), I stuck with a grep-based approach. This greps taxon abbreviations and gets rid of input files that fewer than 6 different taxa. I know its not very pretty or versatile but it gets the job done.

Code:

for FileName in *.fa

do

grep -c NEOM_ $FileName >> $FileName\.taxon_count

grep -c WARG_ $FileName >> $FileName\.taxon_count

grep -c CNIT_ $FileName >> $FileName\.taxon_count

grep -c SVEN_ $FileName >> $FileName\.taxon_count

grep -c CAPI_ $FileName >> $FileName\.taxon_count

#... others removed for brevity

grep -c CAP\| $FileName >> $FileName\.taxon_count

grep -c LGIG\| $FileName >> $FileName\.taxon_count

grep -c NVEC\| $FileName >> $FileName\.taxon_count

grep -c HROB\| $FileName >> $FileName\.taxon_count

taxon_count=`grep -v 0 $FileName\.taxon_count | wc -l`

if [ "$taxon_count" -lt 6 ] ; then

printf "Too few taxa in file $FileName \n"

mv $FileName ./rejected_few_taxa/

fi

rm -rdf *.fa.taxon_count

done

What I am trying to do now is get rid of files that have MORE than X sequences from any one taxon. Could you guys suggest a way I could search my temporary $FileName.taxon_count file for any integer larger than X and perform an operation on that file if one is found?

Thanks again!!
Kevin

that's a slow and badly written code. there's no need to use grep so many times on each file. use grep "word1|word2" to grep multiple patterns.