LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-31-2015, 06:52 PM   #1
sarahgaughan
LQ Newbie
 
Registered: Jul 2015
Posts: 1

Rep: Reputation: Disabled
Need help searching a file and then moving parts of the file to a different file


Hi. I'm very new to Linux, and I'm trying to search a file with a lot of DNA sequence reads and the file looks something like this (with about 50 million more of these):


+
6<6//EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEE EAEAAEEEEEEEEEAEEEEAEEEEEEAAAAAAA6AEE
@Mhy_Loup-47_3:11505:25853:8077 1:N:0:0
GCAAGTCACACACACACACACACACACAGGTAGCCGGCCGCAGCTGAGTTCTCCCTACAAGAAAGGGTGCAAAGAGCTAGCCTTTTTGAACGGGCAACTC ATGATAAAGGAGATCGGAAGAGCGGTTAAGCAGGAATGC
+
E6/<AEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE//</E//EE/EE/<//EE///////AE/EEEEE/E//EE/6E/////<///EE/E/////AE/E/////E/E////</A/</A/A
@Mhy_Loup-47_3:11505:20508:8086 1:N:0:0
TGCACTAGAACATTTTTTGTGTCTCCAGATATGCCTCCTCTTTGCAAATTTCTCATAATCTCATATCCAGTAACCCATGCTCTGTACTTTATTCTTTAAA TCAAAGCAGTGCTGTTATATTTGTCTTACACTAAAACTATT
+
6EE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEAEEEEEEEEEEEEEEEEEAEEEAEAEEEEEAEEEE
@
+
6<6//EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEE EAEAAEEEEEEEEEAEEEEAEEEEEEAAAAAAA6AEE
@Mhy_Loup-47_3:11505:25853:8077 1:N:0:0
GCAAGTCACACACACACACACACACACAGGTAGCCGGCCGCAGCTGAGTTCTCCCTACAAGAAAGGGTGCAAAGAGCTAGCCTTTTTGAACGGGCAACTC ATGATAAAGGAGATCGGAAGAGCGGTTAAGCAGGAATGC


There are 48 individuals which make up this file. Each individual is denoted by an identifier such as Mhy_Loup-47 or a similar identifier. I would like to find a command to search the file (such as cat) and then find this identifier (such as with grep) and then move only the sequences with this identifier to a separate file. I've tried several variations, however, cat and grep don't seem to work; from what I've read this may only work with files (so not part of a file). I know you can write a perl script to do this but frankly I have zero experience with this, and I'm also kinda in a time crunch. Can anybody help me!!!!!!! I would be ever so grateful!
 
Old 07-31-2015, 09:14 PM   #2
Keith Hedger
Senior Member
 
Registered: Jun 2010
Location: Wiltshire, UK
Distribution: Linux From Scratch, Slackware64, Partedmagic
Posts: 2,252

Rep: Reputation: 559Reputation: 559Reputation: 559Reputation: 559Reputation: 559Reputation: 559
try
Code:
 sed -n '/^@Mhy_.*/pg' storage/extSdCard/File1
@Mhy_Loup-47_3:11505:25853:8077 1:N:0:0GCAAGTCACACACACACACACACACACAGGTAGCCGGCCGCAGCTGAGTTCTCCCTACAAGAAAGGGTGCAAAGAGC
TAGCCTTTTTGAACGGGCAACTC ATGATAAAGGAGATCGGAAGAGCGGTTAAGCAGGAATGC
@Mhy_Loup-47_3:11505:20508:8086 1:N:0:0TGCACTAGAACATTTTTTGTGTCTCCAGATATGCCTCCTCTTTGCAAATTTCTCATAATCTCATATCCAGTAACCCA
TGCTCTGTACTTTATTCTTTAAA TCAAAGCAGTGCTGTTATATTTGTCTTACACTAAAACTATT
@Mhy_Loup-47_3:11505:25853:8077 1:N:0:0GCAAGTCACACACACACACACACACACAGGTAGCCGGCCGCAGCTGAGTTCTCCCTACAAGAAAGGGTGCAAAGAGC
TAGCCTTTTTGAACGGGCAACTC ATGATAAAGGAGATCGGAAGAGCGGTTAAGCAGGAATGC
just make the '@Mhy ...' more unique to better filter.
the bit in bold is the command and the bit in italics is the path to the file

Last edited by Keith Hedger; 07-31-2015 at 09:18 PM.
 
Old 07-31-2015, 09:34 PM   #3
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
It would be good to know the actual delimiter between the first and second line of sequence data.
Code:
awk '/^@Mhy/ {getline;print}' input.txt
GCAAGTCACACACACACACACACACACAGGTAGCCGGCCGCAGCTGAGTTCTCCCTACAAGAAAGGGTGCAAAGAGCTAGCCTTTTTGAACGGGCAACTC ATGATAAAGGAGATCGGAAGAGCGGTTAAGCAGGAATGC
TGCACTAGAACATTTTTTGTGTCTCCAGATATGCCTCCTCTTTGCAAATTTCTCATAATCTCATATCCAGTAACCCATGCTCTGTACTTTATTCTTTAAA 
GCAAGTCACACACACACACACACACACAGGTAGCCGGCCGCAGCTGAGTTCTCCCTACAAGAAAGGGTGCAAAGAGCTAGCCTTTTTGAACGGGCAACTC ATGATAAAGGAGATCGGAAGAGCGGTTAAGCAGGAATGC
Note that entry for @Mhy_Loup-47_3:11505:20508:8086 1:N:0:0 fails as the sequence data is actually held in two lines, which would require an additional getline to access.
Code:
awk '/^@Mhy/ {getline;x=$1;getline;print x $1}' input.txt

Last edited by allend; 07-31-2015 at 10:32 PM.
 
Old 07-31-2015, 10:40 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,832

Rep: Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820
Quote:
Originally Posted by allend View Post
It would be good to know the actual delimiter between the first and second line of sequence data.
Indeed.
The stated requirements are inadequate to say the least.

What is the likelihood the data is spread over more than two lines ... or has intervening tokens ....
 
Old 07-31-2015, 11:43 PM   #5
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
Quote:
What is the likelihood the data is spread over more than two lines ...
Good point. Assuming that a line containing a single "+" marks the end of the sequence, then perhaps
Code:
awk '/^@Mhy/ {x="";getline;while($1 != "+"){x=x $1;getline};print x}'

Last edited by allend; 08-01-2015 at 12:03 AM.
 
Old 08-01-2015, 09:57 AM   #6
onebuck
Moderator
 
Registered: Jan 2005
Location: Midwest USA, Central Illinois
Distribution: SlackwareŽ
Posts: 12,541
Blog Entries: 23

Rep: Reputation: 1943Reputation: 1943Reputation: 1943Reputation: 1943Reputation: 1943Reputation: 1943Reputation: 1943Reputation: 1943Reputation: 1943Reputation: 1943Reputation: 1943
Moderator response

Hi,

Welcome to LQ!
Quote:
Originally Posted by sarahgaughan View Post
Hi. I'm very new to Linux, and I'm trying to search a file with a lot of DNA sequence reads and the file looks something like this (with about 50 million more of these):

Code:
+
6<6//EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEAEAAEEEEEEEEEAEEEEAEEEEEEAAAAAAA6AEE
@Mhy_Loup-47_3:11505:25853:8077 1:N:0:0
GCAAGTCACACACACACACACACACACAGGTAGCCGGCCGCAGCTGAGTTCTCCCTACAAGAAAGGGTGCAAAGAGCTAGCCTTTTTGAACGGGCAACTCATGATAAAGGAGATCGGAAGAGCGGTTAAGCAGGAATGC
+
E6/<AEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE//</E//EE/EE/<//EE///////AE/EEEEE/E//EE/6E/////<///EE/E/////AE/E/////E/E////</A/</A/A
@Mhy_Loup-47_3:11505:20508:8086 1:N:0:0
TGCACTAGAACATTTTTTGTGTCTCCAGATATGCCTCCTCTTTGCAAATTTCTCATAATCTCATATCCAGTAACCCATGCTCTGTACTTTATTCTTTAAATCAAAGCAGTGCTGTTATATTTGTCTTACACTAAAACTATT
+
6EE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEEAEAEEEEEAEEEE
@
+
6<6//EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEAEAAEEEEEEEEEAEEEEAEEEEEEAAAAAAA6AEE
@Mhy_Loup-47_3:11505:25853:8077 1:N:0:0
GCAAGTCACACACACACACACACACACAGGTAGCCGGCCGCAGCTGAGTTCTCCCTACAAGAAAGGGTGCAAAGAGCTAGCCTTTTTGAACGGGCAACTCATGATAAAGGAGATCGGAAGAGCGGTTAAGCAGGAATGC
There are 48 individuals which make up this file. Each individual is denoted by an identifier such as Mhy_Loup-47 or a similar identifier. I would like to find a command to search the file (such as cat) and then find this identifier (such as with grep) and then move only the sequences with this identifier to a separate file. I've tried several variations, however, cat and grep don't seem to work; from what I've read this may only work with files (so not part of a file). I know you can write a perl script to do this but frankly I have zero experience with this, and I'm also kinda in a time crunch. Can anybody help me!!!!!!! I would be ever so grateful!
What have you done to find a solution to the problem? Other than to post here.

We will aid you when you help yourself to a solution. Provide us with what you have attempted and then maybe someone will be able to assist. Please consider reading; http://www.linuxquestions.org/questi...#faq_lqwelcome

P.S. When posting data or code a member should learn to use vbcode tags. Code tag is the # sign above the reply window, just highlight text then select the # for data or code. While the balloon to the left of # is for quotes. By using vbcode tags your posts are cleaner therefore easier to read. Look at how I placed the data above within vbcode tags.
Hope this helps.
Have fun & enjoy!


Last edited by onebuck; 08-01-2015 at 09:59 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Unziping & moving(mv) a .sh file to user directory is adding(?) ^M to file lines waddles Slackware 7 12-13-2013 06:37 PM
[SOLVED] Moving of file content to another two files after searching with specific pattern raosr020 Linux - Newbie 2 12-05-2012 09:57 PM
Searching content of the file using File Browser Nautilus susja Linux - Newbie 4 05-28-2011 07:25 PM
Searching .txt file for (specific) strings and printing them to new file Hb_Kai Linux - General 7 02-18-2010 10:09 AM


All times are GMT -5. The time now is 12:13 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration