Script to delete aligned single-character columns with no field separator?
Hey all,
I have a bunch of DNA sequence files formatted like this: Code:
>KOG0001|CAPI|Contig1 I am trying to write a script to remove positions in the sequences that are only represented by a -, X, ?, or N (these represent gaps or missing data). Also, if there is exactly one non-gap/missing character in a position it is also useless (there is nothing to compare it to) so I would like to remove those positions as well. The desired output of this script if the input file above was use would be the following: Code:
>KOG0001|CAPI|Contig1 I'm really not sure where to start here. My first concern is I can't figure out how to tell awk to treat each character in lines not containing a greater-than symbol as a separate field. After that, I'm thinking I should use set up a counter to count the number of lines with gap/missing characters comparing that to the total number of lines not containing greater-than signs? Any help would be greatly appreciated! Kevin |
Try incorporating cut, e.g.
Code:
cut -c 1-4,6-8,10 Code:
$ echo "ATGC-ATGCC" | cut -c 1-4,6-8,10 |
First, let's replay the algorithm: As I understand it: "For each of the characters (-, X, ?, or N), on lines NOT beginning with ">", remove the character unless it occurs in a pair. For pairs, leave one."
Here is a partial solution: Code:
sed '/>/!s/[-X\?N]//g' filename |
sed/grep was my first thought too, but if he knows beforehand that he wants to delete columns 5 and 9, then cut is probably simpler
|
Code:
awk 'BEGIN{FS=OFS=""}$5!~/[ATCG]/{$5=""}$9!~/[ATCG]/{$9=""}{if(NR%2==0){print}}' dna |
I think the mentioned positions 5, 9 and 10 are referred only to the posted example. In a general case, we have to count how many A, T, G or C are in each position of the sequence and exclude positions with 0 or only 1 occurrence of the four bases.
In other words we have to parse the whole file to count items and print the result at the end. A memory problem can arise if the file is very large, but awk is indeed the most suitable tool: Code:
BEGIN { FS = "" } |
How about:
Code:
awk -F "" '!/^>/{for(i=1;i<=NF;i++)if($i !~ /[-X\?N]/ || $i == $(i+1))printf $i;printf "\n"}/^>/' dna |
Maybe only the Original Poster can clarify, but I'm still sure the problem has to be solved "in vertical", not "horizontally" line by line. For me the problem is to compare DNA sequences of the same length (10 positions in the posted example) and for each position:
1. if there are gaps in all sequences, the position must be removed 2. if there are gaps in all sequences but one, the position must be removed 3. if there are at least two bases, the position must be retained in order to compare them. In the example we have: Code:
ATGC-ATGCC |
Hey all,
Thanks for all the replies!!! I'm sorry If my original post wasn't totally clear. I had a hard time trying to articulate this. colucix you are exactly right. The problem needs to be solved vertically. The columns with <2 non-gap/missing characters should be deleted while columns with 2+ should not. The sequences vary in composition so the script needs to look at each column individually for each file. I have to run and teach now but I will to sit down with my books and digest these scripts this evening. Thanks again! Kevin |
colucix,
Your script worked brilliantly! Thanks all! Kevin |
Not sure if this extended question should be at kmkocot or colucix:
I realize this is now solved, but I just had a question (to wrap my head around it). Would it have been possible for the first combination "ATGC-ATGCC" to have had a different ending, say "ATGC-ATGTC" (not sure if TC is valid or not but something other than a double character at the end) based on the other dna references still being there? |
Quote:
|
Well I was thinking that if the last two letters of the dna did not repeat then your vertical math would have been different (I could be wrong here)
The way it has worked out, this time maybe just with these strings, the final output had all the dna lines finishing with the same length, namely 8 characters, but if the first line were to end differently then it would have retained 9 characters or they all would have but the the lower strands would have double bodgy characters (X?-N) at the end??? |
Nope. Maybe there is a bug somewhere, but the main code only counts items:
Code:
for ( i = 1; i <= NF; i++ ) When the first sequence is parsed, the array "position" will be Code:
(/ 1, 1, 1, 1, 0, 1, 1, 1, 1, 1 /) Code:
(/ 2, 2, 2, 2, 0, 2, 2, 2, 1, 2 /) Code:
$0 = sequence[j] |
Ok, I get your code (and fortran away its cool ;)), so if the first or second strand had a bodgy character (X?-N) as the last
character, hence making the total for that column not greater than one, we would have removed both of the final Cs in the first strand. So I guess my questions was related to the fact that we were not getting rid of one of the Cs due to it being a double but because the vertical math did not support it. Thanks |
All times are GMT -5. The time now is 08:36 PM. |