Hi all,
I am comparing the genomes of a few different animals and want to see where a bunch of different genes occur along the genome (synteny). I have a file that contains the information in this format:
Code:
AQUE|Contig13321 79728 80756 2551
MBRE|scaffold_8 185118 185240 2551
MLEI|ML1541 963166 963471 2551
NVEC|scaffold_30 1487213 1487575 2551
PBAC|636656210 9855 10283 2551
TADH|scaffold_10 2275716 2275934 2551
AQUE|Contig12562 21299 22057 2632
MBRE|scaffold_33 180983 181267 2632
MLEI|ML0180 177903 179321 2632
NVEC|scaffold_50 174943 175098 2632
PBAC|636638427 5565 5798 2632
TADH|scaffold_8 463904 464299 2632
Field 1: Taxon abbreviaton|contig/scaffold name (~=chromosome/piece of the genome)
Field 2: Start coordinate of gene on that contig/scaffold
Field 3: End coordinate of gene on that contig/scaffold
Field 4: Gene number
So, for each gene I have 5 lines showing where that gene occurs on the genome. I need to convert this format to the following format (shown for just gene 2551):
Code:
AQUE|Contig13321 79728 80756 MBRE|scaffold_8 185118 185240 #2551
AQUE|Contig13321 79728 80756 MLEI|ML1541 963166 963471 #2551
AQUE|Contig13321 79728 80756 NVEC|scaffold_30 1487213 1487575 #2551
AQUE|Contig13321 79728 80756 PBAC|636656210 9855 10283 #2551
AQUE|Contig13321 79728 80756 TADH|scaffold_10 2275716 2275934 #2551
MBRE|scaffold_8 185118 185240 MLEI|ML1541 963166 963471 #2551
MBRE|scaffold_8 185118 185240 NVEC|scaffold_30 1487213 1487575 #2551
MBRE|scaffold_8 185118 185240 PBAC|636656210 9855 10283 #2551
MBRE|scaffold_8 185118 185240 TADH|scaffold_10 2275716 2275934 #2551
MLEI|ML1541 963166 963471 NVEC|scaffold_30 1487213 1487575 #2551
MLEI|ML1541 963166 963471 PBAC|636656210 9855 10283 #2551
MLEI|ML1541 963166 963471 TADH|scaffold_10 2275716 2275934 #2551
NVEC|scaffold_30 1487213 1487575 PBAC|636656210 9855 10283 #2551
NVEC|scaffold_30 1487213 1487575 TADH|scaffold_10 2275716 2275934 #2551
PBAC|636656210 9855 10283 TADH|scaffold_10 2275716 2275934 #2551
Basically, I need to provide a pairwise list of contig/scaffold name, start, and end positions for each possible pair of the 5 species. I am comparing 5 species for all genes and each gene occurs exactly once for each species. In the future I might be interested in including genes that are absent (or at least not sampled) from one or two of the species. I would also like the gene name to hang there at the end of each line as a comment but that isn't necessary for the program.
I basically have no idea where to begin and was wondering if anyone could point me in the right direction. It seems like this might be a job for a awk?
Thanks so much!
Kevin