Splitting a huge textfile by regular expressions

nouse · 03-05-2015, 07:03 AM

Hi!

I have a fasta file with biological DNA sequences.
Fasta files are build like this:
>This_is_a_FASTA_header
TTTATATATAGACGATGACGATGACA
>The_next_sequence_begins
GGGCACAGTAGCAGA
>And_another
TGCGAGAGGTAGTAGAT

In my case all the header lines (starting with ">") have one 360 indices starting after the ">:
>001_blabla
....
>360_blabla

I want to split my big combined fasta file into 360 single files with sequences sharing the same index.

Thank you very much!

rtmistler · 03-05-2015, 07:09 AM

Script or program would work fine. What have you tried thus far?

I'm guessing whatever you employed to resolve this former problem would also be suitable here, Command Line: Splitting a txt file according to regular expressions in each line

nouse · 03-05-2015, 07:15 AM

Ha, i forgot i encountered something similar recently. It was solved in a different way at that time, and somebody else took care of it.

The thing i need to do is called "dereplication". The numbers in the header represent samples, and i want to have individual files per sample. There are some bioinformatic tools doing this, but they failed for me, so i figured i could do it in shell. I am no expert here, though, and i havent tried anything yet, that is why i am posting in Linux-Newbie.

rtmistler · 03-05-2015, 07:33 AM

Quote:

Originally Posted by nouse

Ha, i forgot i encountered something similar recently. It was solved in a different way at that time, and somebody else took care of it.

The thing i need to do is called "dereplication". The numbers in the header represent samples, and i want to have individual files per sample. There are some bioinformatic tools doing this, but they failed for me, so i figured i could do it in shell. I am no expert here, though, and i havent tried anything yet, that is why i am posting in Linux-Newbie.

If you need help starting with shell scripting, there are a few links in my signature which describe BASH scripting. I also have written a blog about BASH scripting, also a link. LQ is not a situation where people are here to work out solutions for you, but instead to help you to learn how to do these things for yourself, add to your knowledge, maintain a record of that (Note that a good thing would have been to do was to update your prior thread with the solution you attained and marked that thread as SOLVED), and put you in a position where you could possibly offer similar assistance to someone else; likely someone working with these very same types of data. I suspect they would appreciate benefiting by your accumulated knowledge.

The best thing to do is to start a program or script, when you get stuck, post your efforts and describe where you are stuck and people will respond with some suggestions as to how to get to your next step.

Yes, that all comes across as "we're not here to do your work for you" but to me a larger reason is the fact that many, many people, ask for solutions and I find that their initial questions end up lacking what they ultimately wanted or needed. Once they figure out a simple or hard step, suddenly they get an idea that they need to do 20 more steps to get to where they really wanted to be. I do find it's better that they understand every step on their own because it helps them to ultimately determine their solution, for instance they may find that they can shorten a step or remove it entirely. My point is people think one thing when they ask originally, then they later realize they could or should change their thinking and decide to move in a slightly different direction. I absolutely do not wish to spend some amount of time writing something to attain a highly specific solution only to find that it became a throw away; I experience enough of that with my own projects.

syg00 · 03-05-2015, 06:04 PM

My take on something similar a while back - here.

grail · 03-05-2015, 07:48 PM

Also, whilst we may not be prepared to do the work for you, have you done anything to find this answer yourself?

Maybe you should try the search option for this forum?? Searching for 'DNA sequences' yielded me 48 responses which may be of use.
I also remember a few other users specifically referring to these types of strings and file format.