[SOLVED] Shell script to delete amino acid sequences that don't overlap with others
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Shell script to delete amino acid sequences that don't overlap with others
Hi all,
I'm working with alignments of amino acid sequences. My goal is to make a phylogenetic tree for each alignment showing the relationships among the sequences. Here's the problem: some of the sequences are incomplete and the software can't make a tree if two or more sequences don't overlap. I'd like to write a script that looks at each alignment and if two or more sequences don't overlap by at least X number (maybe 20?) A-Z characters, it deletes the shortest sequence in the alignment and re-runs the check.
Dashes represent missing data. In this example the sequences TPO2@m.8293 and OVUL@m.15913 don't overlap. I'd like the script to delete the TPO2@m.8293 sequence and then re-run the check keeping all other sequences because they all overlap with each other by 20+ non gap characters.
Does anyone have any suggestions as to how to go about doing this? It seems like awk is the tool for the job, at least in part, but I'm having a hard time getting started. Any help would be greatly appreciated.
It seems to me you want the String Fns here http://www.grymoire.com/Unix/Awk.html#uh-40 aka 'AWK Table 10'.
Something like gsub to remove all leading '-', then compare len() of what's left of each. If len() of both are >= 20, then they 'overlap' by at least 20 ?
I'm not sure how efficient this would be, but storing the beginning position and end position of amino acids in each sequence, then comparing those for each line could get you what you want. This would get tricky for sequences like grail mentioned.
I think the algorithm needed is more complicated. I'm thinking a character by character comparison of every sequence to every other other sequence, counting the matches, sorting them by matches and then removing the one with the least number of matches ... ?
hmmm ... I am not sure that logic would work schneidz as I do not believe the characters that are not dashes are required to match, just that at the point in the string where you have a character
you also have a character in the other string and this continues consecutively for the minimum (20 characters here).
grail, your are correct. Dsw and texmex are thinking along the same lines I was.
To "overlap" by my definition, there have to be A-Z characters in place but they don't have to be the same A-Z characters (the differences / what's in common among sequences is important for the phylogenetic tree reconstruction).
Also, it is possible that there would be gaps in the middle of the sequence here and there.
I'm still thinking about this and trying to source some help from a local friend.
struct sequence { char id [MAXID]; char seq [MAXSEQ]; unsigned int length; };
int main (int argc, char * argv[]) { // parse argc if (3 != argc) { printf("Usage:\n%s data.txt min_overlaps\nExample:\n%s data.txt 20\n", argv[0], argv[0]); return 1; }
// ! init int j; char string [MAXSEQ];
// init unsigned int i = 0; unsigned int k = 0; unsigned int numrec = 0; unsigned int overlap = 0; int state = 0;
// open file FILE * infile = fopen (argv[1], "r"); if (NULL == infile) { fprintf (stderr, "ERROR: cannot open %s\n", argv[1]); return 1; }
// get number of records while (! feof (infile)) { fscanf (infile, "%s", string); if ('>' == string[0]) { numrec++; } } rewind (infile); struct sequence seqarray [numrec];
// parse while (! feof (infile)) { if (0 == state) { // looking for id fscanf (infile, "%s", seqarray[i].id); if ('>' == seqarray[i].id[0]) { state = 1; } } else if (1 == state) { // looking for seq fscanf (infile, "%s", seqarray[i].seq); if (0 != seqarray[i].seq[0]) { state = 0; // get length of current sequence seqarray[i].length = 0; while (0 != seqarray[i].seq[k]) { if ('-' != seqarray[i].seq[k]) { seqarray[i].length++; } k++; } k = 0; // check current sequence to all those before for (j = i-1; j >= 0; j--) { while ((0 != seqarray[i].seq[k]) && (0 != seqarray[j].seq[k])) { if ((seqarray[i].seq[k] == seqarray[j].seq[k]) && ('-' != seqarray[i].seq[k])) { overlap++; } k++; } k = 0; // delete sequences with overlaps below minimum if (overlap < atoi (argv[2])) { if (seqarray[i].length > seqarray[j].length) { /*printf ("delete = %s\n", seqarray[j].id); printf ("keep = %s\n", seqarray[i].id);*/ seqarray[j].id[0] = 0; } else if (seqarray[i].length < seqarray[j].length) { /*printf ("delete = %s\n", seqarray[i].id); printf ("keep = %s\n", seqarray[j].id);*/ seqarray[i].id[0] = 0; } } overlap = 0; } i++; } } }
// output for (k = 0; k < numrec; k++) { if (0 != seqarray[k].id[0]) { //printf ("%d\n%d\n", k, seqarray[k].length); printf ("%s\n%s\n", seqarray[k].id, seqarray[k].seq); } }
@ H_TeXMeX_H - can you explain the second exclusion? My understanding is that a line is only deleted if it has no overlap with the previous line.
@ OP - just thought of something else, which my code does not allow for, if there are consecutive non-matching lines to the last valid entry, would each line in turn be deleted?
Here's my shot at it with Java. Deletes all lines that do not overlap with every other line by at least 20 amino acids, starting with lines with the most non-overlapping lines. If multiple lines have the same number of non-overlapping lines, deletes the line with the fewest amino acids. Not pretty or efficient, but it gets the job done.
Code:
import java.io.*;
import java.util.ArrayList;
public class AlignmentCompare {
public static void main(String[] args) throws IOException {
boolean notFinished = true;
String line1;
String line2;
String writeLine;
int overlaps;
int badLine;
int i;
int j;
//Dr. Steve Brule has an opinion on not providing input files.
if (args[0] == null) {
System.out.print("Ever wonder why ice cubes taste so boring? " +
"It’s cuz you make ‘em outta stupid water, you bimbo! " +
"Put some fruit juice in there and freeze it into ice " +
"cubes, and put THAT in your milk. " +
"Also, gimme a file to run this on, ya dummy!)");
System.exit(0);
}
//Runs until each line overlaps with every other by at least 20.
while (notFinished) {
File alignmentFilename = new File(args[0]);
File tempFilename = new File("myTempFile.txt");
//Reader for iterating through lines to be compared.
BufferedReader AlignmentReader1 = new BufferedReader(
new FileReader(alignmentFilename));
//Reader for going through remaining lines to compare to Reader1 lines.
BufferedReader AlignmentReader2 = null;
BufferedWriter AlignmentWriter = new BufferedWriter(new FileWriter(
tempFilename));
line1 = null;
line2 = null;
writeLine = null;
overlaps = 0;
badLine = 0;
i = 0;
j = 0;
//ArrayList of ArrayList that holds the line numbers that don't overlaps
// for each line.
ArrayList<ArrayList<Integer>> noOverlaps = new ArrayList<ArrayList<Integer>>(
0);
//Stores the number of amino acids in each line.
ArrayList<Integer> lineLengths = new ArrayList<Integer>(0);
int lineToDelete;
//While loop for comparisons of 1 line to each other line.
while ((line1 = AlignmentReader1.readLine()) != null) {
//ArrayList to keep track of the line numbers that don't overlap with
// with the line currently being looked at.
ArrayList<Integer> currentNoOverlaps = new ArrayList<Integer>(0);
AlignmentReader2 = new BufferedReader(new FileReader(
alignmentFilename));
//If the current line is a header, skip it.
if (line1.charAt(0) == '>') {
line1 = AlignmentReader1.readLine();
i++;
}
//Line 1 shouldn't be compared to line 1. Move down to the line after
// the line that Reader1 just read.
j = i;
for (int inc = 0; inc < i; inc++) {
@SuppressWarnings("unused")
String temp = AlignmentReader2.readLine();
}
//Here we use Reader2 to start reading in all lines after Reader1's
// current line for comparison.
while ((line2 = AlignmentReader2.readLine()) != null) {
//Skip headers
if (line2.charAt(0) == '>') {
line2 = AlignmentReader2.readLine();
j++;
}
overlaps = 0;
//Counting overlaps.
for (int k = 0; k < line1.length() && k < line2.length(); k++) {
//If both chars at position k aren't gaps, increment.
if (checkForGaps(line1, k) == -1
&& checkForGaps(line2, k) == -1) {
overlaps++;
}
}
//If we don't have more than 20 overlaps, add this line to our list.
if (overlaps <= 20) {
currentNoOverlaps.add(j);
}
j++;
}
//Reader2 has iterated through the whole file. Close reader2.
AlignmentReader2.close();
//Get the length of this line in amino acids.
lineLengths.add(findLength(line1, i));
//Add this line's results to the ArrayList of ArrayLists.
noOverlaps.add(currentNoOverlaps);
i++;
}
//Reader1 has iterated through the whole file. Close reader1.
AlignmentReader1.close();
//To determine which line is the worst and should be deleted, we
// count the number of lines that don't overlap with it.
int[] overlapCounts;
overlapCounts = new int[noOverlaps.size()];
for (int l = 0; l < noOverlaps.size(); l++) {
for (int m = 0; m < noOverlaps.get(l).size(); m++) {
badLine = noOverlaps.get(l).get(m);
if (badLine != 0) {
badLine = (badLine + 1) / 2;
//Since comparisons don't go backwards(1 was recorded as not
// overlapping with 2, but not vice versa) we need to account
// for this.
overlapCounts[badLine - 1]++;
overlapCounts[l]++;
}
}
}
//Use findLargestAndSmallest to determine which line has the most lines that
// do not overlaps with it.
lineToDelete = findLargestAndSmallest(overlapCounts, lineLengths);
//Stop when all lines overlap.
if (lineToDelete == 0) {
notFinished = false;
break;
}
//Start readering in file for writing.
AlignmentReader1 = new BufferedReader(new FileReader(
alignmentFilename));
// Find the lineToDelete and don't write it.
i = 1;
while ((writeLine = AlignmentReader1.readLine()) != null) {
//Doubling lineToDelete to account for headers.
if ((i != (lineToDelete * 2))
&& ((i + 1) != (lineToDelete * 2))) {
AlignmentWriter.write(writeLine);
AlignmentWriter.newLine();
}
i++;
}
AlignmentReader1.close();
AlignmentWriter.close();
tempFilename.renameTo(alignmentFilename);
}
}
//Method to check if the current char is a gap.
public static int checkForGaps(String line, int inc) {
String badChars = "-X?";
int check = badChars.indexOf(line.charAt(inc));
return check;
}
//Determines which line has the most non-overlapping lines and returns it.
public static int findLargestAndSmallest(int[] integers,
ArrayList<Integer> lengths) {
boolean conflict = false;
int largestNonOverlaps = 0;
int smallestLength = 0;
int position = 0;
//If two lines have the same number of non-overlapping lines, this ArrayList
// holds their positions.
ArrayList<Integer> choices = new ArrayList<Integer>(1);
//Checking which has the most non-overlaps.
for (int i = 0; i < integers.length; i++) {
if (integers[i] > largestNonOverlaps) {
conflict = false;
choices.clear();
choices.add(i + 1);
largestNonOverlaps = integers[i];
//Save which line has the most non-overlaps.
position = i + 1;
//If equal to the current line with most non-overlaps,
// add to choices without clearing it.
} else if (integers[i] == largestNonOverlaps) {
conflict = true;
choices.add(i + 1);
}
}
//If there are no overlaps, we're done.
if (largestNonOverlaps == 0)
return 0;
//If two lines have the same amount of non-overlapping lines,
// find the one with the smallest number of amino acids.
if (conflict) {
smallestLength = lengths.get(choices.get(0));
position = choices.get(0);
for (int j = 1; j < choices.size(); j++) {
if (lengths.get((choices.get(j)-1)) < smallestLength) {
smallestLength = lengths.get((choices.get(j)-1));
position = choices.get(j);
}
}
}
//Returns the line number of line with the most non-overlaps/smallest length
// of amino acids.
return position;
}
//Finds the length in amino acids of a line.
public static int findLength(String line, int position) {
int length = 0;
for (int i = 0; i < line.length(); i++) {
if (checkForGaps(line, i) == -1)
length++;
}
return length;
}
}
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.