LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-18-2014, 02:50 AM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Rep: Reputation: 15
Question Need help with a script to format an input file for comparative genomics (awk?)


Hi all,

I am comparing the genomes of a few different animals and want to see where a bunch of different genes occur along the genome (synteny). I have a file that contains the information in this format:
Code:
AQUE|Contig13321 79728 80756 2551
MBRE|scaffold_8 185118 185240 2551
MLEI|ML1541 963166 963471 2551
NVEC|scaffold_30 1487213 1487575 2551
PBAC|636656210 9855 10283 2551
TADH|scaffold_10 2275716 2275934 2551
AQUE|Contig12562 21299 22057 2632
MBRE|scaffold_33 180983 181267 2632
MLEI|ML0180 177903 179321 2632
NVEC|scaffold_50 174943 175098 2632
PBAC|636638427 5565 5798 2632
TADH|scaffold_8 463904 464299 2632
Field 1: Taxon abbreviaton|contig/scaffold name (~=chromosome/piece of the genome)
Field 2: Start coordinate of gene on that contig/scaffold
Field 3: End coordinate of gene on that contig/scaffold
Field 4: Gene number

So, for each gene I have 5 lines showing where that gene occurs on the genome. I need to convert this format to the following format (shown for just gene 2551):
Code:
AQUE|Contig13321 79728 80756 MBRE|scaffold_8 185118 185240 #2551
AQUE|Contig13321 79728 80756 MLEI|ML1541 963166 963471 #2551
AQUE|Contig13321 79728 80756 NVEC|scaffold_30 1487213 1487575 #2551
AQUE|Contig13321 79728 80756 PBAC|636656210 9855 10283 #2551
AQUE|Contig13321 79728 80756 TADH|scaffold_10 2275716 2275934 #2551
MBRE|scaffold_8 185118 185240 MLEI|ML1541 963166 963471 #2551
MBRE|scaffold_8 185118 185240 NVEC|scaffold_30 1487213 1487575 #2551
MBRE|scaffold_8 185118 185240 PBAC|636656210 9855 10283 #2551
MBRE|scaffold_8 185118 185240 TADH|scaffold_10 2275716 2275934 #2551
MLEI|ML1541 963166 963471 NVEC|scaffold_30 1487213 1487575 #2551
MLEI|ML1541 963166 963471 PBAC|636656210 9855 10283 #2551
MLEI|ML1541 963166 963471 TADH|scaffold_10 2275716 2275934 #2551
NVEC|scaffold_30 1487213 1487575 PBAC|636656210 9855 10283 #2551
NVEC|scaffold_30 1487213 1487575 TADH|scaffold_10 2275716 2275934 #2551
PBAC|636656210 9855 10283 TADH|scaffold_10 2275716 2275934 #2551
Basically, I need to provide a pairwise list of contig/scaffold name, start, and end positions for each possible pair of the 5 species. I am comparing 5 species for all genes and each gene occurs exactly once for each species. In the future I might be interested in including genes that are absent (or at least not sampled) from one or two of the species. I would also like the gene name to hang there at the end of each line as a comment but that isn't necessary for the program.

I basically have no idea where to begin and was wondering if anyone could point me in the right direction. It seems like this might be a job for a awk?

Thanks so much!
Kevin

Last edited by kmkocot; 06-18-2014 at 02:51 AM.
 
Old 06-18-2014, 03:48 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,423

Rep: Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823
awk, ruby or perl should be able to handle this

My thought would be:

1. grab original line (contains Contig)
2. Display required output of the next 5 records with original pre-pended. At the same time store each line in an array
3. Loop over the array -1 and a inner loop of array +1 and append items
4. goto 1 until end
 
Old 06-18-2014, 05:21 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 15,040

Rep: Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914
Haven't we all been through something (very) similar before ?.
 
Old 06-19-2014, 01:36 AM   #4
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Thanks grail. syg, I have lots of genome-ey problems these days.
 
Old 06-19-2014, 09:44 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 15,040

Rep: Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914Reputation: 1914
Quote:
Originally Posted by kmkocot View Post
I have lots of genome-ey problems these days.
Give your boss a kick up the arse and get him/her to pay someone to help you out.
Could be fixed in less than 30 minutes on a laptop down by the river whilst throwing a frisbee for the mutt. Would let you do what you do and let someone else handle things like this.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
awk commands/script to extract lines with a specific format oldfogey Linux - Software 5 10-02-2012 02:50 PM
redirecting input from file in awk script Trd300 Linux - Newbie 51 09-27-2012 06:57 PM
[SOLVED] awk script that will convert such a line to an LDAP record in this format KaRt Linux - Newbie 4 07-27-2012 07:03 AM
[SOLVED] How to read an external file and print it with a definite format within AWK script tia_chofi Linux - Newbie 2 12-13-2011 05:26 AM
Plz tell me, how to get input in awk script intikhabalam Linux - General 1 07-27-2008 08:01 AM


All times are GMT -5. The time now is 06:06 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration