LinuxQuestions.org - Script to isolate files with 1 instance of each of 4 diff. words (grep)

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Script to isolate files with 1 instance of each of 4 diff. words (grep) (https://www.linuxquestions.org/questions/linux-newbie-8/script-to-isolate-files-with-1-instance-of-each-of-4-diff-words-grep-749779/)

Script to isolate files with 1 instance of each of 4 diff. words (grep)

Hi all,

I am trying to write a script that searches though a directory of files with the names Moll_10000.fasta to Moll_16000.fasta, searches each file for the number of instances of the following strings: LGIG, HROB, NVEC, and CAP, and copies each file containing only one instance of each word to another folder. I know grep is at least in part the tool for the job and that the {1} modifier should be involved (right?). I can't figure out what to do, though. Here's what I've got so far (some of this is probably unnecessary):

Code:

grep -c LGIG\| *.fasta | cat > LGIG_count.txt

grep -c NVEC\| *.fasta | cat > NVEC_count.txt

grep -c HROB\| *.fasta | cat > HROB_count.txt

grep -c CAP\| *.fasta | cat > CAP_count.txt

grep \:1 LGIG_count.txt >> LGIG_single_copy_count.txt

grep \:1 NVEC_count.txt >> NVEC_single_copy_count.txt

grep \:1 HROB_count.txt >> HROB_single_copy_count.txt

grep \:1 CAP_count.txt >> CAP_single_copy_count.txt

Obviously I'd have to do the sorting manually the way I have it so far. Can anyone guide me in the right direction to automate this entire process?

Thanks!
Kevin

Put the following in a file and give the file execute permission (chmod 744 <file name>

Code:

#!/bin/bash

grep -c LGIG\| *.fasta | cat > LGIG_count.txt

grep -c NVEC\| *.fasta | cat > NVEC_count.txt

grep -c HROB\| *.fasta | cat > HROB_count.txt

grep -c CAP\| *.fasta | cat > CAP_count.txt

grep \:1 LGIG_count.txt >> LGIG_single_copy_count.txt

grep \:1 NVEC_count.txt >> NVEC_single_copy_count.txt

grep \:1 HROB_count.txt >> HROB_single_copy_count.txt

grep \:1 CAP_count.txt >> CAP_single_copy_count.txt

Then run it with ./<file name>

From "The Linux cookbook: tips and techniques for everyday use" by Michael Stutz
12.2.3 Listing Only the Unique Words in Text

Presuming there is no punctuations in your text files, the tr command could be :

tr -s '[:blank:]' '\n' <text |sort|uniq

So for solving your specific problem, you can use :

Code:

for file in *.fasta; do

    nb=$(tr -s '[:blank:]' '\n' <$file |sort|uniq|grep -e LGIG -e HROB -e NVEC -e CAP|wc -l)

    if [ $nb = 4 ]; then

        cp $file <destination folder>

    fi

done

replace <destination folder> with the destination folder name without the '<' '>' characters.

Maybe some adjustments are to be made, but something along that line may solve your problem.

Bye.

An afterthought, if the *.fasta files are very large you can restrict the tr command to only the lines containing the searched strings :

Code:

for file in *.fasta; do

    grep -e LGIG -e HROB -e NVEC -e CAP $file >ifile

    zif=$(wc -l ifile)

    if [ $zif != 0 ]; then

        nb=$(tr -s '[:blank:]' '\n' <ifile |sort|uniq|grep -e LGIG -e HROB -e NVEC -e CAP|wc -l)

        if [ $nb = 4 ]; then

            cp $file <destination folder>

        fi

    fi

done

This seems better for large files.