LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 04-20-2010, 03:37 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 98

Rep: Reputation: 15
Script to delete aligned single-character columns with no field separator?


Hey all,

I have a bunch of DNA sequence files formatted like this:
Code:
>KOG0001|CAPI|Contig1
ATGC-ATGCC
>KOG0001|NEOM|Contig95
ATGC-ATG-C
>KOG0001|ACAL|Contig95
ATGC-ATG--
>KOG0001|WARG|Contig3
ATGCXATGXX
>KOG0001|LCIN|Contig99
ATGC?ATG??
>KOG0001|NVEC|Contig33
ATGCNATGNN
The lines beginning with greater-than symbols are the sequence descriptors and the lines immediately after each descriptor with A-Z characters, dashes, and question marks are the aligned DNA sequences. The sequences are always the same length within a file and never span/wrap across more than one line.

I am trying to write a script to remove positions in the sequences that are only represented by a -, X, ?, or N (these represent gaps or missing data). Also, if there is exactly one non-gap/missing character in a position it is also useless (there is nothing to compare it to) so I would like to remove those positions as well.

The desired output of this script if the input file above was use would be the following:
Code:
>KOG0001|CAPI|Contig1
ATGCATGC
>KOG0001|NEOM|Contig95
ATGCATGC
>KOG0001|ACAL|Contig95
ATGCATG-
>KOG0001|WARG|Contig3
ATGCATGX
>KOG0001|LCIN|Contig99
ATGCATG?
>KOG0001|NVEC|Contig33
ATGCATGN
Position 5 (from the left) was removed because it was all gap/missing characters. Position 9 was removed because only one character was a non-gap/missing character. Position 10 was retained because there were 2 non-gap/missing characters.

I'm really not sure where to start here. My first concern is I can't figure out how to tell awk to treat each character in lines not containing a greater-than symbol as a separate field. After that, I'm thinking I should use set up a counter to count the number of lines with gap/missing characters comparing that to the total number of lines not containing greater-than signs? Any help would be greatly appreciated!

Kevin
 
Old 04-20-2010, 03:52 PM   #2
AlucardZero
Senior Member
 
Registered: May 2006
Location: USA
Distribution: Debian
Posts: 4,608

Rep: Reputation: 517Reputation: 517Reputation: 517Reputation: 517Reputation: 517Reputation: 517
Try incorporating cut, e.g.
Code:
cut -c 1-4,6-8,10
Code:
$ echo "ATGC-ATGCC" | cut -c 1-4,6-8,10
ATGCATGC

Last edited by AlucardZero; 04-20-2010 at 03:56 PM.
 
Old 04-20-2010, 04:13 PM   #3
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
First, let's replay the algorithm: As I understand it: "For each of the characters (-, X, ?, or N), on lines NOT beginning with ">", remove the character unless it occurs in a pair. For pairs, leave one."

Here is a partial solution:
Code:
sed '/>/!s/[-X\?N]//g' filename
This deletes every instance of any of the 4 characters. Getting the right action when there is a pair is going to take more thought.....
 
Old 04-20-2010, 04:22 PM   #4
AlucardZero
Senior Member
 
Registered: May 2006
Location: USA
Distribution: Debian
Posts: 4,608

Rep: Reputation: 517Reputation: 517Reputation: 517Reputation: 517Reputation: 517Reputation: 517
sed/grep was my first thought too, but if he knows beforehand that he wants to delete columns 5 and 9, then cut is probably simpler
 
Old 04-20-2010, 04:23 PM   #5
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,964
Blog Entries: 11

Rep: Reputation: 865Reputation: 865Reputation: 865Reputation: 865Reputation: 865Reputation: 865Reputation: 865
Code:
awk 'BEGIN{FS=OFS=""}$5!~/[ATCG]/{$5=""}$9!~/[ATCG]/{$9=""}{if(NR%2==0){print}}'  dna
 
Old 04-20-2010, 05:09 PM   #6
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
I think the mentioned positions 5, 9 and 10 are referred only to the posted example. In a general case, we have to count how many A, T, G or C are in each position of the sequence and exclude positions with 0 or only 1 occurrence of the four bases.

In other words we have to parse the whole file to count items and print the result at the end. A memory problem can arise if the file is very large, but awk is indeed the most suitable tool:
Code:
BEGIN { FS = "" }

!/^>/ {

  sequence[NR] = $0
  
  for ( i = 1; i <= NF; i++ )
    position[i] += ($i ~ /[ATGC]/) 
   
}

/^>/ {

  header[NR] = $0
  
}
  
END {

  for ( j = 1; j <= NR; j++ ) {
    if ( j in header) print header[j]
    if ( j in sequence ) {
    $0 = sequence[j]
    for ( i = 1; i <= NF; i++)
      if ( position[i] > 1 ) printf "%s", $i
    printf "\n"
    }
  }

}
Hope this helps.
 
1 members found this post helpful.
Old 04-20-2010, 08:34 PM   #7
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,424

Rep: Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876
How about:
Code:
awk -F "" '!/^>/{for(i=1;i<=NF;i++)if($i !~ /[-X\?N]/ || $i == $(i+1))printf $i;printf "\n"}/^>/' dna

Last edited by grail; 04-21-2010 at 10:27 AM.
 
Old 04-21-2010, 04:07 AM   #8
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
Maybe only the Original Poster can clarify, but I'm still sure the problem has to be solved "in vertical", not "horizontally" line by line. For me the problem is to compare DNA sequences of the same length (10 positions in the posted example) and for each position:
1. if there are gaps in all sequences, the position must be removed
2. if there are gaps in all sequences but one, the position must be removed
3. if there are at least two bases, the position must be retained in order to compare them.

In the example we have:
Code:
ATGC-ATGCC
ATGC-ATG-C
ATGC-ATG--
ATGCXATGXX
ATGC?ATG??
ATGCNATGNN
----------
6666066612   <-- total number of DNA bases (non-gaps) for each position
where the 5th and 9th positions must be removed, the others retained. Hence, the only chance is to read the entire file and then print out the result. kmkocot, is this correct? We're looking forward to your answer...
 
Old 04-21-2010, 09:17 AM   #9
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 98

Original Poster
Rep: Reputation: 15
Hey all,

Thanks for all the replies!!! I'm sorry If my original post wasn't totally clear. I had a hard time trying to articulate this.

colucix you are exactly right. The problem needs to be solved vertically. The columns with <2 non-gap/missing characters should be deleted while columns with 2+ should not. The sequences vary in composition so the script needs to look at each column individually for each file. I have to run and teach now but I will to sit down with my books and digest these scripts this evening.

Thanks again!
Kevin
 
Old 04-21-2010, 04:51 PM   #10
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 98

Original Poster
Rep: Reputation: 15
Smile

colucix,

Your script worked brilliantly! Thanks all!

Kevin
 
Old 04-21-2010, 09:26 PM   #11
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,424

Rep: Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876
Not sure if this extended question should be at kmkocot or colucix:
I realize this is now solved, but I just had a question (to wrap my head around it).
Would it have been possible for the first combination "ATGC-ATGCC" to have had a different ending,
say "ATGC-ATGTC" (not sure if TC is valid or not but something other than a double character at the end)
based on the other dna references still being there?
 
Old 04-22-2010, 03:08 AM   #12
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
Quote:
Originally Posted by grail View Post
Would it have been possible for the first combination "ATGC-ATGCC" to have had a different ending, say "ATGC-ATGTC" (not sure if TC is valid or not but something other than a double character at the end)
Hi grail Why not? DNA sequences are practically infinite if you consider the length of DNA polymers and changes due to mutation. What are you thinking about?
 
Old 04-22-2010, 06:33 AM   #13
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,424

Rep: Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876
Well I was thinking that if the last two letters of the dna did not repeat then your vertical math would have been different (I could be wrong here)
The way it has worked out, this time maybe just with these strings, the final output had all the dna lines finishing with the same length, namely
8 characters, but if the first line were to end differently then it would have retained 9 characters or they all would have but the the lower
strands would have double bodgy characters (X?-N) at the end???
 
Old 04-22-2010, 08:47 AM   #14
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
Nope. Maybe there is a bug somewhere, but the main code only counts items:
Code:
for ( i = 1; i <= NF; i++ )
    position[i] += ($i ~ /[ATGC]/)
this creates/updates an array whose indexes are the field numbers and whose values are incremented by one if the expression is true. This means that it counts only occurrences of A, T, G and C for each position in the sequence (and each position independently from the others).

When the first sequence is parsed, the array "position" will be
Code:
(/ 1, 1, 1, 1, 0, 1, 1, 1, 1, 1 /)
(forgive the fortran notation). When the second sequence is parsed it will be
Code:
(/ 2, 2, 2, 2, 0, 2, 2, 2, 1, 2 /)
and so on. Only in the END block (after the entire file has been parsed and the array contains the totals) it checks how many A, T, G, C (non-gaps) are in each position. At that point some "columns" will be left out (the same columns/positions for all the sequences):
Code:
    $0 = sequence[j]
    for ( i = 1; i <= NF; i++)
      if ( position[i] > 1 ) printf "%s", $i
    printf "\n"
Take in mind that all the sequences in the input file have the same length. This code relates to NF (of the lines containing a DNA sequence) and NR, so that it should work for any length of the sequence (it may differ for different input files) and for any number of lines in the file (until memory limits are reached... since it stores the entire content of the file in memory).

Last edited by colucix; 04-22-2010 at 08:50 AM.
 
Old 04-22-2010, 10:10 AM   #15
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,424

Rep: Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876Reputation: 1876
Ok, I get your code (and fortran away its cool ), so if the first or second strand had a bodgy character (X?-N) as the last
character, hence making the total for that column not greater than one, we would have removed both of the final Cs in the first strand.

So I guess my questions was related to the fact that we were not getting rid of one of the Cs due to it being a double but because the vertical
math did not support it.

Thanks
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
print FS (field separator) in awk wakatana Programming 5 11-05-2009 08:17 AM
how to keep Field Separator in AWK when using a sub statement tmcguinness Programming 4 02-09-2009 02:24 PM
shell script/command for converting columns/table onto a single line skuz_ball Programming 9 11-30-2007 03:02 AM
perl input field separator Tinkster Programming 5 10-18-2004 04:08 PM
My field separator changes when using awk Helene Programming 3 05-01-2004 08:10 AM


All times are GMT -5. The time now is 09:11 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration