LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Finding specific text from a file that is within specific symbols (https://www.linuxquestions.org/questions/linux-newbie-8/finding-specific-text-from-a-file-that-is-within-specific-symbols-4175508484/)

netpumber 06-19-2014 08:20 AM

Finding specific text from a file that is within specific symbols
 
Hello. I have a big file that contains DNA sequences like this :

Quote:

>gnl|SRA|SRR035295.82647.2 FIHSSUW02I7CY6.2 length=269
TAGAGACCGAGGCGGCCGACATGTTTTGTTTTTTTTTCTTTTTTTTTTCCGTCCAACATGGAATGATTGG
TACGCATCTGCAAATTCTTTGGATGTCACAAATCTGTATGGTGCGTCTCTTCTCATCCAGTATTGCTCCT
GATCTTTTTTTGAAGTCACTTCTTGTAAGAAATCAGCAACGCCTTTCCTTGCAGGGCATTTAAATCCCAT
TGACTCAAAAAACTCAAGGACGTGTTCACGTGGGCCTTGATATACAATTTTGCCATCAG
>gnl|SRA|SRR035295.4505.2 FIHSSUW02H007H.2 length=250
AAGCAGTGGTATCAACGAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTGA
TGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGTT
ACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGACT
GTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGGCG
>gnl|SRA|SRR035296.68126.2 FIQ4L3X01D8M3K.2 length=259
AAACATAATTATACCCTTGCTGAACTCGGCACCAATACTTTGCTTGATCTTTTCTTGAGACAACCTCTTG
GAGGAAATCGGCAACGCCTTTTCTTTCAGGGCACCTAAAACCAAAACCCTCAAAATACTCTAATACATCA
CTTCTCGGCCCATGATAAATAATGACTCCTTCTGCCATCAACATAATGTCATCAAAGAGATCAAATACTT
CTGGTGCTGGTTGAAGAAGTGAAATGACCACAGAAGCGTCTGTTATATG
>gnl|SRA|SRR035294.13646.2 FIHSSUW01ERMVS.2 length=248
AGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTGA
TGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGTT
ACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGACT
GTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035296.38443.2 FIQ4L3X01BO4OB.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035296.36031.2 FIQ4L3X01DKY6J.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035295.53565.2 FIHSSUW02J2P5E.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035294.113925.2 FIHSSUW01BDZ3B.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035294.94312.2 FIHSSUW01EADXQ.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035294.74028.2 FIHSSUW01E2UJV.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
and i want to print out e.g the sequence with id = SRR035294.94312.2

As you can see every sequence has an id and each of them start with a > symbol.
So is there anyway to make a cat in the file and print out the text that contains the id and it is between within two > symbols ?

So to get back

Quote:

>gnl|SRA|SRR035294.94312.2 FIHSSUW01EADXQ.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
Thank you.

bigrigdriver 06-19-2014 10:11 AM

You should be able to do that with awk. Something along the lines of:

Code:

awk 'BEGIN { RS = ">" } ; $1 ~ /SRR035294.94312.2/ { print $0 }' <your dna file name>
where RS is the record seperator and /SRR035294.94312.2/ is the string to match, and print $0 prints the record which contains the matching string.

Caution: it's been a few years since I did anything with awk. The code above probably needs some correction to make it work.

netpumber 06-19-2014 01:05 PM

Ok. Thank you very much. I 'll try it out.

norobro 06-19-2014 01:36 PM

If your data is always five lines grep will do what you want:
Code:

grep "SRR035294.94312.2" file.name -A 4

netpumber 06-19-2014 02:13 PM

Thank you but there wont be always 5 lines.

As you can see below, i am trying to create a script that will give it the string and the file to search. Also you could execute this from every path where you have your files but it seems that there is a problem with the paths of the files

Code:

#!/bin/bash
STRING=$1
FILE=$(pwd)"/"$2

if [ -z "$FILE" ] && [ -z "$STRING" ]
then
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
    OUTPUT=awk 'BEGIN { RS = ">" } ; $0 ~ /$STRING/ { print $0 }' "$FILE"
fi

echo $OUTPUT

Quote:

% fastaFind.sh SRR035295.53565.2 sra_data-DB.fasta
/home/../../Bioinformatics/fastaFind/fastaFind.sh: line 9: BEGIN { RS = ">" } ; $0 ~ /$STRING/ { print $0 }: No such file or directory

szboardstretcher 06-19-2014 02:23 PM

You have to use a command substitution for OUTPUT like this:

Code:

OUTPUT=$(awk 'BEGIN { RS = ">" } ; $0 ~ /$STRING/ { print $0 }' "$FILE")
Not sure that awk will work though...

netpumber 06-19-2014 02:41 PM

Hmm doesn't work with awk .

szboardstretcher 06-19-2014 03:04 PM

You can look into shell variables in awk to fix that up. I'm not that savvy with awk unfortunately.

netpumber 06-19-2014 03:23 PM

It seems that works now with this edit :

Code:

#!/bin/bash
STRING=$1
FILE=$(pwd)"/"$2

if [ -z "$FILE" ]
then
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
  awk  'BEGIN { RS = ">" } ; $0 ~ "'$STRING'" { print $0 }' "$FILE"
fi


netpumber 06-19-2014 03:23 PM

It seems that works now with this edit :

Code:

#!/bin/bash
STRING=$1
FILE=$(pwd)"/"$2

if [ -z "$FILE" ]
then
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
  awk  'BEGIN { RS = ">" } ; $0 ~ "'$STRING'" { print $0 }' "$FILE"
fi



All times are GMT -5. The time now is 09:49 PM.