Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a fasta file with biological DNA sequences.
Fasta files are build like this:
>This_is_a_FASTA_header
TTTATATATAGACGATGACGATGACA
>The_next_sequence_begins
GGGCACAGTAGCAGA
>And_another
TGCGAGAGGTAGTAGAT
In my case all the header lines (starting with ">") have one 360 indices starting after the ">:
>001_blabla
....
>360_blabla
I want to split my big combined fasta file into 360 single files with sequences sharing the same index.
Ha, i forgot i encountered something similar recently. It was solved in a different way at that time, and somebody else took care of it.
The thing i need to do is called "dereplication". The numbers in the header represent samples, and i want to have individual files per sample. There are some bioinformatic tools doing this, but they failed for me, so i figured i could do it in shell. I am no expert here, though, and i havent tried anything yet, that is why i am posting in Linux-Newbie.
Ha, i forgot i encountered something similar recently. It was solved in a different way at that time, and somebody else took care of it.
The thing i need to do is called "dereplication". The numbers in the header represent samples, and i want to have individual files per sample. There are some bioinformatic tools doing this, but they failed for me, so i figured i could do it in shell. I am no expert here, though, and i havent tried anything yet, that is why i am posting in Linux-Newbie.
If you need help starting with shell scripting, there are a few links in my signature which describe BASH scripting. I also have written a blog about BASH scripting, also a link. LQ is not a situation where people are here to work out solutions for you, but instead to help you to learn how to do these things for yourself, add to your knowledge, maintain a record of that (Note that a good thing would have been to do was to update your prior thread with the solution you attained and marked that thread as SOLVED), and put you in a position where you could possibly offer similar assistance to someone else; likely someone working with these very same types of data. I suspect they would appreciate benefiting by your accumulated knowledge.
The best thing to do is to start a program or script, when you get stuck, post your efforts and describe where you are stuck and people will respond with some suggestions as to how to get to your next step.
Yes, that all comes across as "we're not here to do your work for you" but to me a larger reason is the fact that many, many people, ask for solutions and I find that their initial questions end up lacking what they ultimately wanted or needed. Once they figure out a simple or hard step, suddenly they get an idea that they need to do 20 more steps to get to where they really wanted to be. I do find it's better that they understand every step on their own because it helps them to ultimately determine their solution, for instance they may find that they can shorten a step or remove it entirely. My point is people think one thing when they ask originally, then they later realize they could or should change their thinking and decide to move in a slightly different direction. I absolutely do not wish to spend some amount of time writing something to attain a highly specific solution only to find that it became a throw away; I experience enough of that with my own projects.
Also, whilst we may not be prepared to do the work for you, have you done anything to find this answer yourself?
Maybe you should try the search option for this forum?? Searching for 'DNA sequences' yielded me 48 responses which may be of use.
I also remember a few other users specifically referring to these types of strings and file format.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.