LinuxQuestions.org - [SOLVED] Script to delete aligned single-character columns with no field separator?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Script to delete aligned single-character columns with no field separator? (https://www.linuxquestions.org/questions/programming-9/script-to-delete-aligned-single-character-columns-with-no-field-separator-803126/)

Script to delete aligned single-character columns with no field separator?

Hey all,

I have a bunch of DNA sequence files formatted like this:

Code:

>KOG0001|CAPI|Contig1

ATGC-ATGCC

>KOG0001|NEOM|Contig95

ATGC-ATG-C

>KOG0001|ACAL|Contig95

ATGC-ATG--

>KOG0001|WARG|Contig3

ATGCXATGXX

>KOG0001|LCIN|Contig99

ATGC?ATG??

>KOG0001|NVEC|Contig33

ATGCNATGNN

The lines beginning with greater-than symbols are the sequence descriptors and the lines immediately after each descriptor with A-Z characters, dashes, and question marks are the aligned DNA sequences. The sequences are always the same length within a file and never span/wrap across more than one line.

I am trying to write a script to remove positions in the sequences that are only represented by a -, X, ?, or N (these represent gaps or missing data). Also, if there is exactly one non-gap/missing character in a position it is also useless (there is nothing to compare it to) so I would like to remove those positions as well.

The desired output of this script if the input file above was use would be the following:

Code:

>KOG0001|CAPI|Contig1

ATGCATGC

>KOG0001|NEOM|Contig95

ATGCATGC

>KOG0001|ACAL|Contig95

ATGCATG-

>KOG0001|WARG|Contig3

ATGCATGX

>KOG0001|LCIN|Contig99

ATGCATG?

>KOG0001|NVEC|Contig33

ATGCATGN

Position 5 (from the left) was removed because it was all gap/missing characters. Position 9 was removed because only one character was a non-gap/missing character. Position 10 was retained because there were 2 non-gap/missing characters.

I'm really not sure where to start here. My first concern is I can't figure out how to tell awk to treat each character in lines not containing a greater-than symbol as a separate field. After that, I'm thinking I should use set up a counter to count the number of lines with gap/missing characters comparing that to the total number of lines not containing greater-than signs? Any help would be greatly appreciated!

Kevin

Try incorporating cut, e.g.

Code:

cut -c 1-4,6-8,10

Code:

$ echo "ATGC-ATGCC" | cut -c 1-4,6-8,10

ATGCATGC

First, let's replay the algorithm: As I understand it: "For each of the characters (-, X, ?, or N), on lines NOT beginning with ">", remove the character unless it occurs in a pair. For pairs, leave one."

Here is a partial solution:

Code:

sed '/>/!s/[-X\?N]//g' filename

This deletes every instance of any of the 4 characters. Getting the right action when there is a pair is going to take more thought.....

sed/grep was my first thought too, but if he knows beforehand that he wants to delete columns 5 and 9, then cut is probably simpler

Code:

awk 'BEGIN{FS=OFS=""}$5!~/[ATCG]/{$5=""}$9!~/[ATCG]/{$9=""}{if(NR%2==0){print}}' dna

I think the mentioned positions 5, 9 and 10 are referred only to the posted example. In a general case, we have to count how many A, T, G or C are in each position of the sequence and exclude positions with 0 or only 1 occurrence of the four bases.

In other words we have to parse the whole file to count items and print the result at the end. A memory problem can arise if the file is very large, but awk is indeed the most suitable tool:

Code:

BEGIN { FS = "" }



!/^>/ {



  sequence[NR] = $0

  

  for ( i = 1; i <= NF; i++ )

    position[i] += ($i ~ /[ATGC]/) 

  

}



/^>/ {



  header[NR] = $0

  

}

  

END {



  for ( j = 1; j <= NR; j++ ) {

    if ( j in header) print header[j]

    if ( j in sequence ) {

    $0 = sequence[j]

    for ( i = 1; i <= NF; i++)

      if ( position[i] > 1 ) printf "%s", $i

    printf "\n"

    }

  }



}

Hope this helps.

How about:

Code:

awk -F "" '!/^>/{for(i=1;i<=NF;i++)if($i !~ /[-X\?N]/ || $i == $(i+1))printf $i;printf "\n"}/^>/' dna

Maybe only the Original Poster can clarify, but I'm still sure the problem has to be solved "in vertical", not "horizontally" line by line. For me the problem is to compare DNA sequences of the same length (10 positions in the posted example) and for each position:
1. if there are gaps in all sequences, the position must be removed
2. if there are gaps in all sequences but one, the position must be removed
3. if there are at least two bases, the position must be retained in order to compare them.

In the example we have:

Code:

ATGC-ATGCC

ATGC-ATG-C

ATGC-ATG--

ATGCXATGXX

ATGC?ATG??

ATGCNATGNN

----------

6666066612  <-- total number of DNA bases (non-gaps) for each position

where the 5th and 9th positions must be removed, the others retained. Hence, the only chance is to read the entire file and then print out the result. kmkocot, is this correct? We're looking forward to your answer... ;)

Hey all,

Thanks for all the replies!!! I'm sorry If my original post wasn't totally clear. I had a hard time trying to articulate this.

colucix you are exactly right. The problem needs to be solved vertically. The columns with <2 non-gap/missing characters should be deleted while columns with 2+ should not. The sequences vary in composition so the script needs to look at each column individually for each file. I have to run and teach now but I will to sit down with my books and digest these scripts this evening.

Thanks again!
Kevin

colucix,

Your script worked brilliantly! Thanks all!

Kevin

Not sure if this extended question should be at kmkocot or colucix:
I realize this is now solved, but I just had a question (to wrap my head around it).
Would it have been possible for the first combination "ATGC-ATGCC" to have had a different ending,
say "ATGC-ATGTC" (not sure if TC is valid or not but something other than a double character at the end)
based on the other dna references still being there?

Quote:

Originally Posted by grail (Post 3943425)

Would it have been possible for the first combination "ATGC-ATGCC" to have had a different ending, say "ATGC-ATGTC" (not sure if TC is valid or not but something other than a double character at the end)

Hi grail :) Why not? DNA sequences are practically infinite if you consider the length of DNA polymers and changes due to mutation. What are you thinking about?

Well I was thinking that if the last two letters of the dna did not repeat then your vertical math would have been different (I could be wrong here)
The way it has worked out, this time maybe just with these strings, the final output had all the dna lines finishing with the same length, namely
8 characters, but if the first line were to end differently then it would have retained 9 characters or they all would have but the the lower
strands would have double bodgy characters (X?-N) at the end???

Nope. Maybe there is a bug somewhere, but the main code only counts items:

Code:

for ( i = 1; i <= NF; i++ )

    position[i] += ($i ~ /[ATGC]/)

this creates/updates an array whose indexes are the field numbers and whose values are incremented by one if the expression is true. This means that it counts only occurrences of A, T, G and C for each position in the sequence (and each position independently from the others).

When the first sequence is parsed, the array "position" will be

Code:

(/ 1, 1, 1, 1, 0, 1, 1, 1, 1, 1 /)

(forgive the fortran notation). When the second sequence is parsed it will be

Code:

(/ 2, 2, 2, 2, 0, 2, 2, 2, 1, 2 /)

and so on. Only in the END block (after the entire file has been parsed and the array contains the totals) it checks how many A, T, G, C (non-gaps) are in each position. At that point some "columns" will be left out (the same columns/positions for all the sequences):

Code:

    $0 = sequence[j]

    for ( i = 1; i <= NF; i++)

      if ( position[i] > 1 ) printf "%s", $i

    printf "\n"

Take in mind that all the sequences in the input file have the same length. This code relates to NF (of the lines containing a DNA sequence) and NR, so that it should work for any length of the sequence (it may differ for different input files) and for any number of lines in the file (until memory limits are reached... since it stores the entire content of the file in memory).

Ok, I get your code (and fortran away its cool ;)), so if the first or second strand had a bodgy character (X?-N) as the last
character, hence making the total for that column not greater than one, we would have removed both of the final Cs in the first strand.

So I guess my questions was related to the fact that we were not getting rid of one of the Cs due to it being a double but because the vertical
math did not support it.

Thanks