LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Script to isolate files with 1 instance of each of 4 diff. words (grep) (https://www.linuxquestions.org/questions/linux-newbie-8/script-to-isolate-files-with-1-instance-of-each-of-4-diff-words-grep-749779/)

kmkocot 08-24-2009 12:51 AM

Script to isolate files with 1 instance of each of 4 diff. words (grep)
 
Hi all,

I am trying to write a script that searches though a directory of files with the names Moll_10000.fasta to Moll_16000.fasta, searches each file for the number of instances of the following strings: LGIG, HROB, NVEC, and CAP, and copies each file containing only one instance of each word to another folder. I know grep is at least in part the tool for the job and that the {1} modifier should be involved (right?). I can't figure out what to do, though. Here's what I've got so far (some of this is probably unnecessary):

Code:

grep -c LGIG\| *.fasta | cat > LGIG_count.txt
grep -c NVEC\| *.fasta | cat > NVEC_count.txt
grep -c HROB\| *.fasta | cat > HROB_count.txt
grep -c CAP\| *.fasta | cat > CAP_count.txt
grep \:1 LGIG_count.txt >> LGIG_single_copy_count.txt
grep \:1 NVEC_count.txt >> NVEC_single_copy_count.txt
grep \:1 HROB_count.txt >> HROB_single_copy_count.txt
grep \:1 CAP_count.txt >> CAP_single_copy_count.txt

Obviously I'd have to do the sorting manually the way I have it so far. Can anyone guide me in the right direction to automate this entire process?

Thanks!
Kevin

catkin 08-24-2009 01:42 AM

Put the following in a file and give the file execute permission (chmod 744 <file name>
Code:

#!/bin/bash
grep -c LGIG\| *.fasta | cat > LGIG_count.txt
grep -c NVEC\| *.fasta | cat > NVEC_count.txt
grep -c HROB\| *.fasta | cat > HROB_count.txt
grep -c CAP\| *.fasta | cat > CAP_count.txt
grep \:1 LGIG_count.txt >> LGIG_single_copy_count.txt
grep \:1 NVEC_count.txt >> NVEC_single_copy_count.txt
grep \:1 HROB_count.txt >> HROB_single_copy_count.txt
grep \:1 CAP_count.txt >> CAP_single_copy_count.txt

Then run it with ./<file name>

berbae 08-24-2009 08:14 AM

From "The Linux cookbook: tips and techniques for everyday use" by Michael Stutz
12.2.3 Listing Only the Unique Words in Text

Presuming there is no punctuations in your text files, the tr command could be :

tr -s '[:blank:]' '\n' <text |sort|uniq

So for solving your specific problem, you can use :

Code:

for file in *.fasta; do
    nb=$(tr -s '[:blank:]' '\n' <$file |sort|uniq|grep -e LGIG -e HROB -e NVEC -e CAP|wc -l)
    if [ $nb = 4 ]; then
        cp $file <destination folder>
    fi
done

replace <destination folder> with the destination folder name without the '<' '>' characters.

Maybe some adjustments are to be made, but something along that line may solve your problem.

Bye.

berbae 08-24-2009 08:54 AM

An afterthought, if the *.fasta files are very large you can restrict the tr command to only the lines containing the searched strings :
Code:

for file in *.fasta; do
    grep -e LGIG -e HROB -e NVEC -e CAP $file >ifile
    zif=$(wc -l ifile)
    if [ $zif != 0 ]; then
        nb=$(tr -s '[:blank:]' '\n' <ifile |sort|uniq|grep -e LGIG -e HROB -e NVEC -e CAP|wc -l)
        if [ $nb = 4 ]; then
            cp $file <destination folder>
        fi
    fi
done

This seems better for large files.


All times are GMT -5. The time now is 04:30 PM.