LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-19-2014, 09:20 AM   #1
netpumber
Member
 
Registered: Sep 2007
Location: In My Box
Distribution: Arch Linux
Posts: 384

Rep: Reputation: 32
Finding specific text from a file that is within specific symbols


Hello. I have a big file that contains DNA sequences like this :

Quote:
>gnl|SRA|SRR035295.82647.2 FIHSSUW02I7CY6.2 length=269
TAGAGACCGAGGCGGCCGACATGTTTTGTTTTTTTTTCTTTTTTTTTTCCGTCCAACATGGAATGATTGG
TACGCATCTGCAAATTCTTTGGATGTCACAAATCTGTATGGTGCGTCTCTTCTCATCCAGTATTGCTCCT
GATCTTTTTTTGAAGTCACTTCTTGTAAGAAATCAGCAACGCCTTTCCTTGCAGGGCATTTAAATCCCAT
TGACTCAAAAAACTCAAGGACGTGTTCACGTGGGCCTTGATATACAATTTTGCCATCAG
>gnl|SRA|SRR035295.4505.2 FIHSSUW02H007H.2 length=250
AAGCAGTGGTATCAACGAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTGA
TGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGTT
ACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGACT
GTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGGCG
>gnl|SRA|SRR035296.68126.2 FIQ4L3X01D8M3K.2 length=259
AAACATAATTATACCCTTGCTGAACTCGGCACCAATACTTTGCTTGATCTTTTCTTGAGACAACCTCTTG
GAGGAAATCGGCAACGCCTTTTCTTTCAGGGCACCTAAAACCAAAACCCTCAAAATACTCTAATACATCA
CTTCTCGGCCCATGATAAATAATGACTCCTTCTGCCATCAACATAATGTCATCAAAGAGATCAAATACTT
CTGGTGCTGGTTGAAGAAGTGAAATGACCACAGAAGCGTCTGTTATATG
>gnl|SRA|SRR035294.13646.2 FIHSSUW01ERMVS.2 length=248
AGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTGA
TGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGTT
ACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGACT
GTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035296.38443.2 FIQ4L3X01BO4OB.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035296.36031.2 FIQ4L3X01DKY6J.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035295.53565.2 FIHSSUW02J2P5E.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035294.113925.2 FIHSSUW01BDZ3B.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035294.94312.2 FIHSSUW01EADXQ.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
>gnl|SRA|SRR035294.74028.2 FIHSSUW01E2UJV.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
and i want to print out e.g the sequence with id = SRR035294.94312.2

As you can see every sequence has an id and each of them start with a > symbol.
So is there anyway to make a cat in the file and print out the text that contains the id and it is between within two > symbols ?

So to get back

Quote:
>gnl|SRA|SRR035294.94312.2 FIHSSUW01EADXQ.2 length=249
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGTCCTCCAAGTTCGGGAAAGACAACACTTTTG
ATGGCCTTGGCTGGCACACTTGCAAAAGAGCTTAAGAGTTCGGGTAAAGTAACATATAATGGGCATGAGT
TACATGAGTTTGTACCTGAAAGAACTGCTGCTTATATCAGCCAGAATGATCTCCATATTGGAGAAATGAC
TGTAAGAGAAACATTGGCTTTCTCTGCAAGATGTCAAGG
Thank you.
 
Old 06-19-2014, 11:11 AM   #2
bigrigdriver
LQ Addict
 
Registered: Jul 2002
Location: East Centra Illinois, USA
Distribution: Debian Jessie 8.4
Posts: 5,873

Rep: Reputation: 348Reputation: 348Reputation: 348Reputation: 348
You should be able to do that with awk. Something along the lines of:

Code:
awk 'BEGIN { RS = ">" } ; $1 ~ /SRR035294.94312.2/ { print $0 }' <your dna file name>
where RS is the record seperator and /SRR035294.94312.2/ is the string to match, and print $0 prints the record which contains the matching string.

Caution: it's been a few years since I did anything with awk. The code above probably needs some correction to make it work.

Last edited by bigrigdriver; 06-19-2014 at 11:14 AM.
 
1 members found this post helpful.
Old 06-19-2014, 02:05 PM   #3
netpumber
Member
 
Registered: Sep 2007
Location: In My Box
Distribution: Arch Linux
Posts: 384

Original Poster
Rep: Reputation: 32
Ok. Thank you very much. I 'll try it out.

Last edited by netpumber; 06-19-2014 at 02:13 PM.
 
Old 06-19-2014, 02:36 PM   #4
norobro
Member
 
Registered: Feb 2006
Distribution: Debian Sid
Posts: 516

Rep: Reputation: 176Reputation: 176
If your data is always five lines grep will do what you want:
Code:
grep "SRR035294.94312.2" file.name -A 4
 
Old 06-19-2014, 03:13 PM   #5
netpumber
Member
 
Registered: Sep 2007
Location: In My Box
Distribution: Arch Linux
Posts: 384

Original Poster
Rep: Reputation: 32
Thank you but there wont be always 5 lines.

As you can see below, i am trying to create a script that will give it the string and the file to search. Also you could execute this from every path where you have your files but it seems that there is a problem with the paths of the files

Code:
#!/bin/bash
STRING=$1
FILE=$(pwd)"/"$2

if [ -z "$FILE" ] && [ -z "$STRING" ]
then
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
    OUTPUT=awk 'BEGIN { RS = ">" } ; $0 ~ /$STRING/ { print $0 }' "$FILE"
fi

echo $OUTPUT
Quote:
% fastaFind.sh SRR035295.53565.2 sra_data-DB.fasta
/home/../../Bioinformatics/fastaFind/fastaFind.sh: line 9: BEGIN { RS = ">" } ; $0 ~ /$STRING/ { print $0 }: No such file or directory
 
Old 06-19-2014, 03:23 PM   #6
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 3,774
Blog Entries: 1

Rep: Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339
You have to use a command substitution for OUTPUT like this:

Code:
OUTPUT=$(awk 'BEGIN { RS = ">" } ; $0 ~ /$STRING/ { print $0 }' "$FILE")
Not sure that awk will work though...
 
Old 06-19-2014, 03:41 PM   #7
netpumber
Member
 
Registered: Sep 2007
Location: In My Box
Distribution: Arch Linux
Posts: 384

Original Poster
Rep: Reputation: 32
Hmm doesn't work with awk .
 
Old 06-19-2014, 04:04 PM   #8
szboardstretcher
Senior Member
 
Registered: Aug 2006
Location: Detroit, MI
Distribution: GNU/Linux systemd
Posts: 3,774
Blog Entries: 1

Rep: Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339Reputation: 1339
You can look into shell variables in awk to fix that up. I'm not that savvy with awk unfortunately.
 
Old 06-19-2014, 04:23 PM   #9
netpumber
Member
 
Registered: Sep 2007
Location: In My Box
Distribution: Arch Linux
Posts: 384

Original Poster
Rep: Reputation: 32
It seems that works now with this edit :

Code:
#!/bin/bash
STRING=$1
FILE=$(pwd)"/"$2

if [ -z "$FILE" ] 
then 
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
   awk  'BEGIN { RS = ">" } ; $0 ~ "'$STRING'" { print $0 }' "$FILE"
fi
 
Old 06-19-2014, 04:23 PM   #10
netpumber
Member
 
Registered: Sep 2007
Location: In My Box
Distribution: Arch Linux
Posts: 384

Original Poster
Rep: Reputation: 32
It seems that works now with this edit :

Code:
#!/bin/bash
STRING=$1
FILE=$(pwd)"/"$2

if [ -z "$FILE" ] 
then 
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
   awk  'BEGIN { RS = ">" } ; $0 ~ "'$STRING'" { print $0 }' "$FILE"
fi
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding a specific file in Ubuntu Yuhan Linux - Software 8 05-14-2011 03:08 PM
[SOLVED] change a specific characters to capital in a specific text ramzaher Linux - Newbie 12 12-03-2010 05:50 AM
Command For Finding Specific File Types? forkbeard Linux - Newbie 7 10-16-2009 01:02 AM
How to find and change a specific text in a text file by using shell script Bassam Programming 1 07-18-2005 08:15 PM
Finding a specific file j_carmona Linux - Newbie 12 05-02-2005 08:42 PM


All times are GMT -5. The time now is 06:44 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration