[SOLVED] Shell script to delete amino acid sequences that don't overlap with others

kmkocot · 05-08-2013, 04:31 PM

Hi all,

I'm working with alignments of amino acid sequences. My goal is to make a phylogenetic tree for each alignment showing the relationships among the sequences. Here's the problem: some of the sequences are incomplete and the software can't make a tree if two or more sequences don't overlap. I'd like to write a script that looks at each alignment and if two or more sequences don't overlap by at least X number (maybe 20?) A-Z characters, it deletes the shortest sequence in the alignment and re-runs the check.

Here's an example:

Code:

>BARE@m.16041
SSESEGEAGLRQKLDWEWTERLDVTPDNDFAREKMFYDLTMACSKEALARLSSLNYPVTRPDDYFAQMAKSDEHMKKVRQKLLSKQSSIEKSEQAKKIREMRKYGKKVQHEVLQKRQKEKKAV
>PARA@m.359
DSDEELREGLLSCLTLPWVERLDIVSDDDFKREMFIYRQAQAAVLQAIPKLRKVGVATRRPEDYLAQMAKSDQHMVKVRARLLSKQKETENREKAKKLREMRKRGKKIQQEVLAQRQKEKKEN
>PFUC@aug1.0_108.1_50634.t1
LSKRNLQAGMKQKLNLEWIERLDLTNANDFTRELRFYRQAQASVLEGIPRLQALDIQTKRPEDYFAEMAKSDDHMKRVREKLLEKQMSIDRREKVRKQRDLRKYGKKVQQEVLLKRQKEKKER
>LCAL@m.22729
NSDDELQEKMKEKLNLEWIERLDLTNENDFKREMRFYRQAQSSVIQGIERLHKLKILTKRPDDYFAQMAKSDEHMNKVREKLVAKQSSMEKSEKAKKLRELRKFGKKVQQDVLKKRQKDKKE-
>CTOR@m.12274
-------AGLQQCLDLPWVERLDVSSTDDFKREMKFYRQAQGAVLESLPRLQALRIKTSRPSDYLAEMAKSDQHMKKVREKLISKKIAIERSEKAKKLRELRK--------------------
>BARE@m.16042
-----------------------------LRPGKMFYDLTMACSKEALARLSSLNYPVTRPDDYFAQMAKSDEHMKKVRQKLLSKQSSIEKSEQAKKIREMRKYGKKVQHEVLQKRQKEKKAR
>LHYA@m.308
---------MKAKLGLDWVERLDLTSINDFKREMKFYRQAQSSVLEGISKLHALGIPTKRPEDYFAQMAKSDDHMKKIREKLLAKQIGMERSEKAKKFRELRKYGKKVQHEVIQ---------
>MGRO@m.26921
DSDFELQTSMKQCLHLPWVERLDQISEDDFKRELTFYRQAQGTVLKALYELDKLKILTRRPEDYFAQMAKTDIHMKKVREKLMGKKLSMERSEKAKKLRELRKYGKKVQQEVLMNRQKQKKER
>LRUG@m.27419
DSDKELQTGLKQKTTQNWIERLDMTNDNDFQRELKFYRQAQATVLEAIPKLHKLGVLTKRPEDYFAQMAKSDDHMKKVREKLLSKQTAIDRSEKAKKMRELRKYGKKVQRDVLQKRQKEKKDR
>LPEC@m.2341
-------RGLLQKLGLDWVERLDITCPDDFVREKMFYDLAQAAATEGLSRLTELRYPTKRPDDYFAQMAKSDEHMRKVRSKLLSKEAVAERSEQAKKLREMRKFGKQVQQD------------
>TPO2@m.8293
-----------------------------------------------------------------------------VREKLLSKEQALERSEKAKKLRNLRKYGKKVQTEVIQKRHKEKREK
>OVUL@m.15913
-----LQDDMENILNLNWIERLDYTSMNDFQRELCFYKQAQSAVLSLMPKLRTNGIPTKRPDDYFAEMAKSDLHM------------------------------------------------

Dashes represent missing data. In this example the sequences TPO2@m.8293 and OVUL@m.15913 don't overlap. I'd like the script to delete the TPO2@m.8293 sequence and then re-run the check keeping all other sequences because they all overlap with each other by 20+ non gap characters.

Does anyone have any suggestions as to how to go about doing this? It seems like awk is the tool for the job, at least in part, but I'm having a hard time getting started. Any help would be greatly appreciated.

Thanks!
Kevin

chrism01 · 05-08-2013, 11:40 PM

It seems to me you want the String Fns here http://www.grymoire.com/Unix/Awk.html#uh-40 aka 'AWK Table 10'.
Something like gsub to remove all leading '-', then compare len() of what's left of each. If len() of both are >= 20, then they 'overlap' by at least 20 ?

grail · 05-09-2013, 03:16 AM

I think chrism01's suggestion is good, however, is it likely that you could have lines with missing data not at the ends?

Code:

>LRUG@m.27419
DSDKELQTGLKQKTTQNWIERLDMTNDNDFQRELKFYRQAQATVL-----LHKLGVLTKRPEDYFAQMAKSDDHMKKVREKLLSKQTAI----------ELRKYGKKVQRDVLQKRQKEKKDR

Is the above possible? This would make the overlapping harder to resolve.

I also assume that overlapping means that characters overlap between lines but are not required to be the same characters

Dsw0002 · 05-09-2013, 10:41 AM

I'm not sure how efficient this would be, but storing the beginning position and end position of amino acids in each sequence, then comparing those for each line could get you what you want. This would get tricky for sequences like grail mentioned.

H_TeXMeX_H · 05-09-2013, 12:07 PM

I think the algorithm needed is more complicated. I'm thinking a character by character comparison of every sequence to every other other sequence, counting the matches, sorting them by matches and then removing the one with the least number of matches ... ?

schneidz · 05-09-2013, 01:09 PM

seems more practical in c:

Code:

for(i=0; i < strlen(string1); i++)
{
 if(string1[i] != '-')
 {
  if(string1[i] == string2[i])
   match++;
 }
}

grail · 05-10-2013, 04:23 AM

hmmm ... I am not sure that logic would work schneidz as I do not believe the characters that are not dashes are required to match, just that at the point in the string where you have a character
you also have a character in the other string and this continues consecutively for the minimum (20 characters here).

OP ... Is this a correct analogy?

kmkocot · 05-10-2013, 11:13 AM

grail, your are correct. Dsw and texmex are thinking along the same lines I was.

To "overlap" by my definition, there have to be A-Z characters in place but they don't have to be the same A-Z characters (the differences / what's in common among sequences is important for the phylogenetic tree reconstruction).

Also, it is possible that there would be gaps in the middle of the sequence here and there.

I'm still thinking about this and trying to source some help from a local friend.

Thanks,
Kevin

grail · 05-11-2013, 03:53 AM

ok ... its not pretty, but this seems to work with the current test set:

Code:

#!/usr/bin/awk -f

{
    new = $0
    getline current
}

NR > 2{

    j = split(previous,a,"")
    split(current,b,"")

    for(i=1;i<=j;i++)
        if(a[i] != "-" && b[i] != "-")
            if(a[i] == "-" || b[i] == "-")
                cnt=0
            else
                cnt++

    if(cnt > 20)
        print old RT previous

    cnt = 0
}

{
    old = new
    previous = current
}

END{    print old RT previous }

H_TeXMeX_H · 05-11-2013, 04:49 AM

Here's my attempt in C, just as an exercise:

PHP Code:



#include <stdio.h> 
#include <stdlib.h> 
 
#define MAXID 255 
#define MAXSEQ 1000 
 
struct sequence 
{ 
    char id [MAXID]; 
    char seq [MAXSEQ]; 
    unsigned int length; 
}; 
 
int main (int argc, char * argv[]) 
{ 
    // parse argc 
    if (3 != argc) 
    { 
        printf("Usage:\n%s data.txt min_overlaps\nExample:\n%s data.txt 20\n", argv[0], argv[0]); 
        return 1; 
    } 
 
    // ! init 
    int j; 
    char string [MAXSEQ]; 
 
    // init 
    unsigned int i = 0; 
    unsigned int k = 0; 
    unsigned int numrec = 0; 
    unsigned int overlap = 0; 
    int state = 0; 
 
    // open file 
    FILE * infile = fopen (argv[1], "r"); 
    if (NULL == infile) 
    { 
        fprintf (stderr, "ERROR: cannot open %s\n", argv[1]); 
        return 1; 
    } 
 
    // get number of records 
    while (! feof (infile)) 
    { 
        fscanf (infile, "%s", string); 
        if ('>' == string[0]) 
        { 
            numrec++; 
        } 
    } 
    rewind (infile); 
    struct sequence seqarray [numrec]; 
 
    // parse 
    while (! feof (infile)) 
    { 
        if (0 == state) 
        { 
            // looking for id 
            fscanf (infile, "%s", seqarray[i].id); 
            if ('>' == seqarray[i].id[0]) 
            { 
                state = 1; 
            } 
        } 
        else if (1 == state) 
        { 
            // looking for seq 
            fscanf (infile, "%s", seqarray[i].seq); 
            if (0 != seqarray[i].seq[0]) 
            { 
                state = 0; 
                // get length of current sequence 
                seqarray[i].length = 0; 
                while (0 != seqarray[i].seq[k]) 
                { 
                    if ('-' != seqarray[i].seq[k]) 
                    { 
                        seqarray[i].length++; 
                    } 
                    k++; 
                } 
                k = 0; 
                // check current sequence to all those before 
                for (j = i-1; j >= 0; j--) 
                { 
                    while ((0 != seqarray[i].seq[k]) && (0 != seqarray[j].seq[k])) 
                    { 
                        if ((seqarray[i].seq[k] == seqarray[j].seq[k]) && ('-' != seqarray[i].seq[k])) 
                        { 
                            overlap++; 
                        } 
                        k++; 
                    } 
                    k = 0; 
                    // delete sequences with overlaps below minimum 
                    if (overlap < atoi (argv[2])) 
                    { 
                        if (seqarray[i].length > seqarray[j].length) 
                        { 
                            /*printf ("delete = %s\n", seqarray[j].id); 
                            printf ("keep = %s\n", seqarray[i].id);*/ 
                            seqarray[j].id[0] = 0; 
                        } 
                        else if (seqarray[i].length < seqarray[j].length) 
                        { 
                            /*printf ("delete = %s\n", seqarray[i].id); 
                            printf ("keep = %s\n", seqarray[j].id);*/ 
                            seqarray[i].id[0] = 0; 
                        } 
                    } 
                    overlap = 0; 
                } 
                i++; 
            } 
        } 
    } 
 
    // output 
    for (k = 0; k < numrec; k++) 
    { 
        if (0 != seqarray[k].id[0]) 
        { 
            //printf ("%d\n%d\n", k, seqarray[k].length); 
            printf ("%s\n%s\n", seqarray[k].id, seqarray[k].seq); 
        } 
    } 
 
    fclose (infile); 
 
    return 0; 
}

The output of './parse data.txt 20' is:

Code:

>BARE@m.16041
SSESEGEAGLRQKLDWEWTERLDVTPDNDFAREKMFYDLTMACSKEALARLSSLNYPVTRPDDYFAQMAKSDEHMKKVRQKLLSKQSSIEKSEQAKKIREMRKYGKKVQHEVLQKRQKEKKAV
>PARA@m.359
DSDEELREGLLSCLTLPWVERLDIVSDDDFKREMFIYRQAQAAVLQAIPKLRKVGVATRRPEDYLAQMAKSDQHMVKVRARLLSKQKETENREKAKKLREMRKRGKKIQQEVLAQRQKEKKEN
>PFUC@aug1.0_108.1_50634.t1
LSKRNLQAGMKQKLNLEWIERLDLTNANDFTRELRFYRQAQASVLEGIPRLQALDIQTKRPEDYFAEMAKSDDHMKRVREKLLEKQMSIDRREKVRKQRDLRKYGKKVQQEVLLKRQKEKKER
>LCAL@m.22729
NSDDELQEKMKEKLNLEWIERLDLTNENDFKREMRFYRQAQSSVIQGIERLHKLKILTKRPDDYFAQMAKSDEHMNKVREKLVAKQSSMEKSEKAKKLRELRKFGKKVQQDVLKKRQKDKKE-
>CTOR@m.12274
-------AGLQQCLDLPWVERLDVSSTDDFKREMKFYRQAQGAVLESLPRLQALRIKTSRPSDYLAEMAKSDQHMKKVREKLISKKIAIERSEKAKKLRELRK--------------------
>BARE@m.16042
-----------------------------LRPGKMFYDLTMACSKEALARLSSLNYPVTRPDDYFAQMAKSDEHMKKVRQKLLSKQSSIEKSEQAKKIREMRKYGKKVQHEVLQKRQKEKKAR
>LHYA@m.308
---------MKAKLGLDWVERLDLTSINDFKREMKFYRQAQSSVLEGISKLHALGIPTKRPEDYFAQMAKSDDHMKKIREKLLAKQIGMERSEKAKKFRELRKYGKKVQHEVIQ---------
>MGRO@m.26921
DSDFELQTSMKQCLHLPWVERLDQISEDDFKRELTFYRQAQGTVLKALYELDKLKILTRRPEDYFAQMAKTDIHMKKVREKLMGKKLSMERSEKAKKLRELRKYGKKVQQEVLMNRQKQKKER
>LRUG@m.27419
DSDKELQTGLKQKTTQNWIERLDMTNDNDFQRELKFYRQAQATVLEAIPKLHKLGVLTKRPEDYFAQMAKSDDHMKKVREKLLSKQTAIDRSEKAKKMRELRKYGKKVQRDVLQKRQKEKKDR
>LPEC@m.2341
-------RGLLQKLGLDWVERLDITCPDDFVREKMFYDLAQAAATEGLSRLTELRYPTKRPDDYFAQMAKSDEHMRKVRSKLLSKEAVAERSEQAKKLREMRKFGKQVQQD------------

However, note that it deletes
>TPO2@m.8293 because of OVUL@m.15913
>OVUL@m.15913 because of BARE@m.16042

Hope that is correct. There may be bugs in the code.

grail · 05-11-2013, 05:08 AM

@ H_TeXMeX_H - can you explain the second exclusion? My understanding is that a line is only deleted if it has no overlap with the previous line.

@ OP - just thought of something else, which my code does not allow for, if there are consecutive non-matching lines to the last valid entry, would each line in turn be deleted?

H_TeXMeX_H · 05-11-2013, 05:31 AM

Quote:

Originally Posted by grail

@ H_TeXMeX_H - can you explain the second exclusion? My understanding is that a line is only deleted if it has no overlap with the previous line.

My take on it is that each line is compared to every other.

The OP should clarify.

schneidz · 05-11-2013, 03:52 PM

still not understanding. am i rediculously underestimating the scope of this problem again ?:

Code:

for(i=0; i < strlen(string1); i++)
{
 if((string1[i] != '-') && (string2[i] != '-'))
   match++;
}

Dsw0002 · 05-13-2013, 02:16 PM

Here's my shot at it with Java. Deletes all lines that do not overlap with every other line by at least 20 amino acids, starting with lines with the most non-overlapping lines. If multiple lines have the same number of non-overlapping lines, deletes the line with the fewest amino acids. Not pretty or efficient, but it gets the job done.

Code:

import java.io.*;
import java.util.ArrayList;

public class AlignmentCompare {

	public static void main(String[] args) throws IOException {

		boolean notFinished = true;
		String line1;
		String line2;
		String writeLine;
		int overlaps;
		int badLine;
		int i;
		int j;
		
		//Dr. Steve Brule has an opinion on not providing input files.
		if (args[0] == null) {
			System.out.print("Ever wonder why ice cubes taste so boring? " +
					"It’s cuz you make ‘em outta stupid water, you bimbo! " +
					"Put some fruit juice in there and freeze it into ice " +
					"cubes, and put THAT in your milk. " +
					"Also, gimme a file to run this on, ya dummy!)");
			System.exit(0);
		}
		//Runs until each line overlaps with every other by at least 20.
		while (notFinished) {
			File alignmentFilename = new File(args[0]);
			File tempFilename = new File("myTempFile.txt");
			
			//Reader for iterating through lines to be compared.
			BufferedReader AlignmentReader1 = new BufferedReader(
					new FileReader(alignmentFilename));
			//Reader for going through remaining lines to compare to Reader1 lines.
			BufferedReader AlignmentReader2 = null;

			BufferedWriter AlignmentWriter = new BufferedWriter(new FileWriter(
					tempFilename));

			line1 = null;
			line2 = null;
			writeLine = null;
			overlaps = 0;
			badLine = 0;
			i = 0;
			j = 0;
			
			//ArrayList of ArrayList that holds the line numbers that don't overlaps
			// for each line.
			ArrayList<ArrayList<Integer>> noOverlaps = new ArrayList<ArrayList<Integer>>(
					0);
			//Stores the number of amino acids in each line.
			ArrayList<Integer> lineLengths = new ArrayList<Integer>(0);
			int lineToDelete;

			//While loop for comparisons of 1 line to each other line.
			while ((line1 = AlignmentReader1.readLine()) != null) {
				//ArrayList to keep track of the line numbers that don't overlap with
				// with the line currently being looked at.
				ArrayList<Integer> currentNoOverlaps = new ArrayList<Integer>(0);
				AlignmentReader2 = new BufferedReader(new FileReader(
						alignmentFilename));
				//If the current line is a header, skip it.
				if (line1.charAt(0) == '>') {
					line1 = AlignmentReader1.readLine();
					i++;
				}
				//Line 1 shouldn't be compared to line 1. Move down to the line after
				// the line that Reader1 just read.
				j = i;
				for (int inc = 0; inc < i; inc++) {
					@SuppressWarnings("unused")
					String temp = AlignmentReader2.readLine();

				}
				//Here we use Reader2 to start reading in all lines after Reader1's
				// current line for comparison.
				while ((line2 = AlignmentReader2.readLine()) != null) {
					//Skip headers
					if (line2.charAt(0) == '>') {
						line2 = AlignmentReader2.readLine();
						j++;
					}
					overlaps = 0;
					//Counting overlaps.
					for (int k = 0; k < line1.length() && k < line2.length(); k++) {
						//If both chars at position k aren't gaps, increment.
						if (checkForGaps(line1, k) == -1
								&& checkForGaps(line2, k) == -1) {
							overlaps++;
						}
					}
					//If we don't have more than 20 overlaps, add this line to our list.
					if (overlaps <= 20) {
						currentNoOverlaps.add(j);
					}
					j++;
				}
				//Reader2 has iterated through the whole file. Close reader2.
				AlignmentReader2.close();
				//Get the length of this line in amino acids.
				lineLengths.add(findLength(line1, i));
				//Add this line's results to the ArrayList of ArrayLists.
				noOverlaps.add(currentNoOverlaps);
				i++;
			}
			//Reader1 has iterated through the whole file. Close reader1.
			AlignmentReader1.close();
			
			//To determine which line is the worst and should be deleted, we
			// count the number of lines that don't overlap with it.
			int[] overlapCounts;
			overlapCounts = new int[noOverlaps.size()];

			for (int l = 0; l < noOverlaps.size(); l++) {
				for (int m = 0; m < noOverlaps.get(l).size(); m++) {
					badLine = noOverlaps.get(l).get(m);
					if (badLine != 0) {
						badLine = (badLine + 1) / 2;
						//Since comparisons don't go backwards(1 was recorded as not
						// overlapping with 2, but not vice versa) we need to account
						// for this.
						overlapCounts[badLine - 1]++;
						overlapCounts[l]++;
					}
				}
			}
			
			//Use findLargestAndSmallest to determine which line has the most lines that
			// do not overlaps with it.
			lineToDelete = findLargestAndSmallest(overlapCounts, lineLengths);
			
			//Stop when all lines overlap.
			if (lineToDelete == 0) {
				notFinished = false;
				break;
			}
			//Start readering in file for writing.
			AlignmentReader1 = new BufferedReader(new FileReader(
					alignmentFilename));

				// Find the lineToDelete and don't write it.
				i = 1;
				while ((writeLine = AlignmentReader1.readLine()) != null) {
					//Doubling lineToDelete to account for headers.
					if ((i != (lineToDelete * 2))
							&& ((i + 1) != (lineToDelete * 2))) {
						AlignmentWriter.write(writeLine);
						AlignmentWriter.newLine();
					}
					i++;
				}
			
			AlignmentReader1.close();
			AlignmentWriter.close();
			tempFilename.renameTo(alignmentFilename);
		}
	}

	//Method to check if the current char is a gap.
	public static int checkForGaps(String line, int inc) {
		String badChars = "-X?";
		int check = badChars.indexOf(line.charAt(inc));
		return check;
	}
	
	//Determines which line has the most non-overlapping lines and returns it.
	public static int findLargestAndSmallest(int[] integers,
			ArrayList<Integer> lengths) {
		boolean conflict = false;
		int largestNonOverlaps = 0;
		int smallestLength = 0;
		int position = 0;
		//If two lines have the same number of non-overlapping lines, this ArrayList
		// holds their positions.
		ArrayList<Integer> choices = new ArrayList<Integer>(1);
		
		//Checking which has the most non-overlaps.
		for (int i = 0; i < integers.length; i++) {
			if (integers[i] > largestNonOverlaps) {
				conflict = false;
				choices.clear();
				choices.add(i + 1);
				largestNonOverlaps = integers[i];
				//Save which line has the most non-overlaps.
				position = i + 1;
				//If equal to the current line with most non-overlaps,
				// add to choices without clearing it.
			} else if (integers[i] == largestNonOverlaps) {
				conflict = true;
				choices.add(i + 1);
			}
		}
		//If there are no overlaps, we're done.
		if (largestNonOverlaps == 0) 
			return 0;
		//If two lines have the same amount of non-overlapping lines,
		// find the one with the smallest number of amino acids.
		if (conflict) {
			smallestLength = lengths.get(choices.get(0));
			position = choices.get(0);
			for (int j = 1; j < choices.size(); j++) {
				if (lengths.get((choices.get(j)-1)) < smallestLength) {
					smallestLength = lengths.get((choices.get(j)-1));
					position = choices.get(j);
				}
			}
		}
		//Returns the line number of line with the most non-overlaps/smallest length
		// of amino acids.
		return position;
	}
	
	//Finds the length in amino acids of a line.
	public static int findLength(String line, int position) {
		int length = 0;
		for (int i = 0; i < line.length(); i++) {
			if (checkForGaps(line, i) == -1)
				length++;
		}
		return length;
	}
}

kmkocot · 05-14-2013, 07:49 PM

Wow! That's amazing. I compiled it with javac and it does exactly what I wanted. Thanks!!!