LinuxQuestions.org
Support LQ: Use code LQCO20 and save 20% on CrossOver Office
Go Back   LinuxQuestions.org > Forums > Linux > Linux - Newbie
User Name
Password
Linux - Newbie This forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Thread Tools
Old 11-08-2009, 06:30 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Posts: 30
Thanked: 0
grep a file for any interger larger than X and assign largest value to a variable


[Log in to get rid of this advertisement]
Hi all,

I have a directory full of amino acid sequence files formatted like this:
Code:
>14424|LGIG|134818
MEIVTTDSVSIVETPTWKRPLELLNS
>14424|NVEC|199771
MADEGSGFSDCSSRQYLNPLQLFEEK
>14424|CAP|112268
MTTYKRPLELLEGDSLVKICAPMVRY
>14424|HROB|107587
MSVEENKFKNSMELFESKDVVKICAP
>14424|Euprymna_scolopes_pep.fas|ESCO_227709683_c_s
QKMEHAGASWVTVHGRTPEQRGEPVN
I want to write a script that will go through and count the number of occurrences of particular strings associated with the taxon names and if there are more than X sequences for a particular taxon, delete the entire file.

Here is the code I have so far using X=1 for this example:

Code:
for FileName in *.fa
do
grep -c ESCO_ $FileName >> taxon_count.txt
grep -c CAP| $FileName >> taxon_count.txt
grep -c LGIG| $FileName >> taxon_count.txt
grep -c NVEC| $FileName >> taxon_count.txt
grep -c HROB| $FileName >> taxon_count.txt


#??? somehow grep for any integer larger than 1 in taxon_count.txt and assign the largest value recovered in the variable $sequences_per_taxon

if [ "$sequences_per_taxon" -gt 1 ] ; then
printf "More than one sequence per taxon in $FileName"
mv $FileName ./rejected/
fi
done
Can anyone help me with the missing step?

Thanks!
Kevin
linuxubuntu kmkocot is offline     Reply With Quote
Old 11-08-2009, 06:41 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :}
Posts: 18,846
Blog Entries: 1
Thanked: 160
Help us :}

We're not biochemists (most people here, anyway). Which part of the
strings above is your taxon. And as you seem to be slapping all matches
into the same taxon_count.txt, how do you want to go about determining
which files to actually delete?


Cheers,
Tink
linux Tinkster is online now     Reply With Quote
Thanked by:
Old 11-08-2009, 06:56 PM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 1,814
Blog Entries: 5
Thanked: 115
use awk for parsing files. Use arrays to store your count. Its also not clear which field is your "taxon", so i assume it is. A sample:
2nd field where line starts with ">"
Code:
awk -F"|" '/^>/{ 
 taxon[$2]++
}
END{
 for(o in  taxon){
    print o,taxon[o]
 }
}
' file
linuxfedora ghostdog74 is offline     Reply With Quote
Old 11-08-2009, 09:38 PM   #4
kmkocot
Member
 
Registered: Dec 2007
Posts: 30
Thanked: 0

Original Poster
Thanks for the replies.

Tinkster, I forgot to include a line deleting the taxon_count.txt file after each .fa file is processed. The taxon name abbreviations for the sequences I pasted below would be LGIG, NVEC, CAP, HROB, and ESCO respectively. The problem is the 'header line' (the line containing the greater-than sign) varies in format from taxon to taxon but maybe the best thing to do is to reformat them all before processing them with these scripts.

ghostdog74, thanks for your help. My brain is fried for tonight but tomorrow I am going to try to implement your awk suggestion into the script I have now. I may have more questions.

Thanks again to both of you.

Kevin
linuxubuntu kmkocot is offline     Reply With Quote
Old 11-09-2009, 11:56 AM   #5
kmkocot
Member
 
Registered: Dec 2007
Posts: 30
Thanked: 0

Original Poster
The problem I ran into is that all my input files are formatted differently so instead of format each one to have the taxon abbreviation in the same field (for awk), I stuck with a grep-based approach. This greps taxon abbreviations and gets rid of input files that fewer than 6 different taxa. I know its not very pretty or versatile but it gets the job done.

Code:
for FileName in *.fa
do
grep -c NEOM_ $FileName >> $FileName\.taxon_count
grep -c WARG_ $FileName >> $FileName\.taxon_count
grep -c CNIT_ $FileName >> $FileName\.taxon_count
grep -c SVEN_ $FileName >> $FileName\.taxon_count
grep -c CAPI_ $FileName >> $FileName\.taxon_count
#... others removed for brevity
grep -c CAP\| $FileName >> $FileName\.taxon_count
grep -c LGIG\| $FileName >> $FileName\.taxon_count
grep -c NVEC\| $FileName >> $FileName\.taxon_count
grep -c HROB\| $FileName >> $FileName\.taxon_count
taxon_count=`grep -v 0 $FileName\.taxon_count | wc -l`
if [ "$taxon_count" -lt 6 ] ; then
printf "Too few taxa in file $FileName \n"
mv $FileName ./rejected_few_taxa/
fi
rm -rdf *.fa.taxon_count
done
What I am trying to do now is get rid of files that have MORE than X sequences from any one taxon. Could you guys suggest a way I could search my temporary $FileName.taxon_count file for any integer larger than X and perform an operation on that file if one is found?

Thanks again!!
Kevin
linuxubuntu kmkocot is offline     Reply With Quote
Old 11-09-2009, 06:26 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 1,814
Blog Entries: 5
Thanked: 115
that's a slow and badly written code. there's no need to use grep so many times on each file. use grep "word1|word2" to grep multiple patterns.
linuxfedora ghostdog74 is offline     Reply With Quote

Reply

Bookmarks


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Assign file name to a variable w/out knowing file name... dots Linux - Newbie 3 07-02-2009 04:19 PM
How do you assign a variable to be a variable file name in a directory? David_Elliott Programming 4 04-14-2009 11:19 AM
how to select a particular line from a file and assign it to a variable? pdklinux79 Linux - Newbie 4 06-18-2008 08:21 PM
Grep a file with a variable from an array..... OldGaf Programming 6 06-15-2008 07:16 PM
Need to assign result of grep to var in PERL amytys Programming 1 09-23-2004 07:25 PM


All times are GMT -5. The time now is 04:14 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
RSS2  LQ Podcast
RSS2  LQ Radio
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration