grep a file for any interger larger than X and assign largest value to a variable
Hi all,
I have a directory full of amino acid sequence files formatted like this: Code:
>14424|LGIG|134818 Here is the code I have so far using X=1 for this example: Code:
for FileName in *.fa Thanks! Kevin |
Help us :}
We're not biochemists (most people here, anyway). Which part of the strings above is your taxon. And as you seem to be slapping all matches into the same taxon_count.txt, how do you want to go about determining which files to actually delete? Cheers, Tink |
use awk for parsing files. Use arrays to store your count. Its also not clear which field is your "taxon", so i assume it is. A sample:
2nd field where line starts with ">" Code:
|
Thanks for the replies.
Tinkster, I forgot to include a line deleting the taxon_count.txt file after each .fa file is processed. The taxon name abbreviations for the sequences I pasted below would be LGIG, NVEC, CAP, HROB, and ESCO respectively. The problem is the 'header line' (the line containing the greater-than sign) varies in format from taxon to taxon but maybe the best thing to do is to reformat them all before processing them with these scripts. ghostdog74, thanks for your help. My brain is fried for tonight but tomorrow I am going to try to implement your awk suggestion into the script I have now. I may have more questions. Thanks again to both of you. Kevin |
The problem I ran into is that all my input files are formatted differently so instead of format each one to have the taxon abbreviation in the same field (for awk), I stuck with a grep-based approach. This greps taxon abbreviations and gets rid of input files that fewer than 6 different taxa. I know its not very pretty or versatile but it gets the job done.
Code:
for FileName in *.fa Thanks again!! Kevin |
that's a slow and badly written code. there's no need to use grep so many times on each file. use grep "word1|word2" to grep multiple patterns.
|
All times are GMT -5. The time now is 07:58 PM. |