LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 03-05-2015, 07:03 AM   #1
nouse
LQ Newbie
 
Registered: Sep 2013
Posts: 21

Rep: Reputation: Disabled
Splitting a huge textfile by regular expressions


Hi!

I have a fasta file with biological DNA sequences.
Fasta files are build like this:
>This_is_a_FASTA_header
TTTATATATAGACGATGACGATGACA
>The_next_sequence_begins
GGGCACAGTAGCAGA
>And_another
TGCGAGAGGTAGTAGAT

In my case all the header lines (starting with ">") have one 360 indices starting after the ">:
>001_blabla
....
>360_blabla

I want to split my big combined fasta file into 360 single files with sequences sharing the same index.

Thank you very much!
 
Old 03-05-2015, 07:09 AM   #2
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,883
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Script or program would work fine. What have you tried thus far?

I'm guessing whatever you employed to resolve this former problem would also be suitable here, Command Line: Splitting a txt file according to regular expressions in each line
 
Old 03-05-2015, 07:15 AM   #3
nouse
LQ Newbie
 
Registered: Sep 2013
Posts: 21

Original Poster
Rep: Reputation: Disabled
Ha, i forgot i encountered something similar recently. It was solved in a different way at that time, and somebody else took care of it.

The thing i need to do is called "dereplication". The numbers in the header represent samples, and i want to have individual files per sample. There are some bioinformatic tools doing this, but they failed for me, so i figured i could do it in shell. I am no expert here, though, and i havent tried anything yet, that is why i am posting in Linux-Newbie.
 
Old 03-05-2015, 07:33 AM   #4
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,883
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Quote:
Originally Posted by nouse View Post
Ha, i forgot i encountered something similar recently. It was solved in a different way at that time, and somebody else took care of it.

The thing i need to do is called "dereplication". The numbers in the header represent samples, and i want to have individual files per sample. There are some bioinformatic tools doing this, but they failed for me, so i figured i could do it in shell. I am no expert here, though, and i havent tried anything yet, that is why i am posting in Linux-Newbie.
If you need help starting with shell scripting, there are a few links in my signature which describe BASH scripting. I also have written a blog about BASH scripting, also a link. LQ is not a situation where people are here to work out solutions for you, but instead to help you to learn how to do these things for yourself, add to your knowledge, maintain a record of that (Note that a good thing would have been to do was to update your prior thread with the solution you attained and marked that thread as SOLVED), and put you in a position where you could possibly offer similar assistance to someone else; likely someone working with these very same types of data. I suspect they would appreciate benefiting by your accumulated knowledge.

The best thing to do is to start a program or script, when you get stuck, post your efforts and describe where you are stuck and people will respond with some suggestions as to how to get to your next step.

Yes, that all comes across as "we're not here to do your work for you" but to me a larger reason is the fact that many, many people, ask for solutions and I find that their initial questions end up lacking what they ultimately wanted or needed. Once they figure out a simple or hard step, suddenly they get an idea that they need to do 20 more steps to get to where they really wanted to be. I do find it's better that they understand every step on their own because it helps them to ultimately determine their solution, for instance they may find that they can shorten a step or remove it entirely. My point is people think one thing when they ask originally, then they later realize they could or should change their thinking and decide to move in a slightly different direction. I absolutely do not wish to spend some amount of time writing something to attain a highly specific solution only to find that it became a throw away; I experience enough of that with my own projects.

Last edited by rtmistler; 03-06-2015 at 07:00 AM.
 
Old 03-05-2015, 06:04 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,140

Rep: Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123
My take on something similar a while back - here.
 
Old 03-05-2015, 07:48 PM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,008

Rep: Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193Reputation: 3193
Also, whilst we may not be prepared to do the work for you, have you done anything to find this answer yourself?

Maybe you should try the search option for this forum?? Searching for 'DNA sequences' yielded me 48 responses which may be of use.
I also remember a few other users specifically referring to these types of strings and file format.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Command Line: Splitting a txt file according to regular expressions in each line nouse Linux - Newbie 1 12-15-2014 10:45 AM
Regular Expressions Wim Sturkenboom Programming 10 11-19-2009 01:21 AM
Regular Expressions ziggy25 Linux - Newbie 7 11-05-2007 06:57 AM
Splitting humongously huge text file frankie_DJ Programming 17 05-31-2007 04:38 PM
Regular expressions aromes Linux - General 1 10-15-2003 12:29 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 10:35 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration