[SOLVED] Need help with a script to format an input file for comparative genomics (awk?)

kmkocot · 06-18-2014, 01:50 AM

Hi all,

I am comparing the genomes of a few different animals and want to see where a bunch of different genes occur along the genome (synteny). I have a file that contains the information in this format:

Code:

AQUE|Contig13321 79728 80756 2551
MBRE|scaffold_8 185118 185240 2551
MLEI|ML1541 963166 963471 2551
NVEC|scaffold_30 1487213 1487575 2551
PBAC|636656210 9855 10283 2551
TADH|scaffold_10 2275716 2275934 2551
AQUE|Contig12562 21299 22057 2632
MBRE|scaffold_33 180983 181267 2632
MLEI|ML0180 177903 179321 2632
NVEC|scaffold_50 174943 175098 2632
PBAC|636638427 5565 5798 2632
TADH|scaffold_8 463904 464299 2632

Field 1: Taxon abbreviaton|contig/scaffold name (~=chromosome/piece of the genome)
Field 2: Start coordinate of gene on that contig/scaffold
Field 3: End coordinate of gene on that contig/scaffold
Field 4: Gene number

So, for each gene I have 5 lines showing where that gene occurs on the genome. I need to convert this format to the following format (shown for just gene 2551):

Code:

AQUE|Contig13321 79728 80756 MBRE|scaffold_8 185118 185240 #2551
AQUE|Contig13321 79728 80756 MLEI|ML1541 963166 963471 #2551
AQUE|Contig13321 79728 80756 NVEC|scaffold_30 1487213 1487575 #2551
AQUE|Contig13321 79728 80756 PBAC|636656210 9855 10283 #2551
AQUE|Contig13321 79728 80756 TADH|scaffold_10 2275716 2275934 #2551
MBRE|scaffold_8 185118 185240 MLEI|ML1541 963166 963471 #2551
MBRE|scaffold_8 185118 185240 NVEC|scaffold_30 1487213 1487575 #2551
MBRE|scaffold_8 185118 185240 PBAC|636656210 9855 10283 #2551
MBRE|scaffold_8 185118 185240 TADH|scaffold_10 2275716 2275934 #2551
MLEI|ML1541 963166 963471 NVEC|scaffold_30 1487213 1487575 #2551
MLEI|ML1541 963166 963471 PBAC|636656210 9855 10283 #2551
MLEI|ML1541 963166 963471 TADH|scaffold_10 2275716 2275934 #2551
NVEC|scaffold_30 1487213 1487575 PBAC|636656210 9855 10283 #2551
NVEC|scaffold_30 1487213 1487575 TADH|scaffold_10 2275716 2275934 #2551
PBAC|636656210 9855 10283 TADH|scaffold_10 2275716 2275934 #2551

Basically, I need to provide a pairwise list of contig/scaffold name, start, and end positions for each possible pair of the 5 species. I am comparing 5 species for all genes and each gene occurs exactly once for each species. In the future I might be interested in including genes that are absent (or at least not sampled) from one or two of the species. I would also like the gene name to hang there at the end of each line as a comment but that isn't necessary for the program.

I basically have no idea where to begin and was wondering if anyone could point me in the right direction. It seems like this might be a job for a awk?

Thanks so much!
Kevin

grail · 06-18-2014, 02:48 AM

awk, ruby or perl should be able to handle this

My thought would be:

1. grab original line (contains Contig)
2. Display required output of the next 5 records with original pre-pended. At the same time store each line in an array
3. Loop over the array -1 and a inner loop of array +1 and append items
4. goto 1 until end

syg00 · 06-18-2014, 04:21 AM

Haven't we all been through something (very) similar before ?.

kmkocot · 06-19-2014, 12:36 AM

Thanks grail. syg, I have lots of genome-ey problems these days.

syg00 · 06-19-2014, 08:44 AM

Quote:

Originally Posted by kmkocot

I have lots of genome-ey problems these days.

Give your boss a kick up the arse and get him/her to pay someone to help you out.
Could be fixed in less than 30 minutes on a laptop down by the river whilst throwing a frisbee for the mutt. Would let you do what you do and let someone else handle things like this.