-   Programming (
-   -   Script to get position #s of first and last A-Z character on certain lines? (

kmkocot 10-03-2010 06:49 PM

Script to get position #s of first and last A-Z character on certain lines?
Hi all,

I have a set of files containing DNA or amino acid sequences from various organisms:



The even-numbered lines beginning with a greater-than symbol are the headers with 3 fields separated by pipe symbols. The number in the first field is the gene ID, the 4-letter abbreviation in the second field is the organism ID, and the third field contains a unique sequence ID. There may be 2+ sequences from an organism (e.g., NEOM and TEST). The lines containing only A-Z characters, question marks, and dashes are amino acid sequences (-, ?, and X represent gaps or missing data). These example sequences are really short and in reality most sequences are several hundred amino acids long.

I would like to write a script to print the range of character positions between the first and last non-missing data characters on the amino acid sequence lines. Internal missing data characters flanked by A-Z characters are OK. Here's what I have in mind for the desired output:


>14432|LGIG|186221        3-20
>14432|CAP1|21057        1-20
>14432|HROB|156827        3-20
>14432|NVEC|159589        2-19
>14432|NEOM|short1        17-20
>14432|NEOM|short2        2-8
>14432|TEST|123456        9-20
>14432|TEST|234567        3-20

Can anyone help me get the position of the first and last non-missing data characters (while allowing missing data characters in the middle of the sequence)? I'm sure it is a simple sed or awk command but I can't figure it out. I think I can produce the output file I want once I have figured those commands out.

My ultimate goal is to write a script that can make composite sequences from two or more non-overlapping sequences (e.g., the two sequences from NEOM). I may also want to merge sequences that partially overlap (e.g., those from TEST) but that would complicate things. Is this a logical first step for such a script or would you do it differently?


Sergei Steshenko 10-03-2010 07:19 PM

Well, in Perl regular expressions can be used: -> .

In general, there is BioPerl: - maybe this or similar task has already been solved.

GrapefruiTgirl 10-03-2010 07:38 PM

awk - count DNA bits & bytes.
Gotta like these interesting DNA problems you keep having. :)

Try this:

#!/usr/bin/awk -f

BEGIN { RS = "\n>"

 if ($1 != "") {
  for (x=1; x<=strlen; x++){
  if (substr($2,x,1) ~ /[A-Z]/) {start = x; break}
  for (x=strlen; x>start; x--) {
  if (substr($2,x,1) ~ /[A-Z]/) {end = x; break}

  printf ">" $1"\t"start" - "end"\n"


sasha@reactor: ./dna-positions.awk dnadata
>14432|LGIG|186221      3 - 20
>14432|CAP1|21057      1 - 20
>14432|HROB|156827      3 - 20
>14432|NVEC|159589      2 - 19
>14432|NEOM|short1      17 - 20
>14432|NEOM|short2      2 - 8
>14432|TEST|123456      9 - 20
>14432|TEST|234567      3 - 20

Only problem I have with it, and it's a minor one in my opinion, is the empty line first produced because of what I used as a record separator.


grail 10-03-2010 09:39 PM

Another alternative in awk would be:

awk '/^>/{printf "%s\t",$0;next}{start=match($0,/[A-Z]+.*[A-Z]/);print start"-"start+RLENGTH-1}' file

ghostdog74 10-03-2010 11:51 PM


import string
for line in open("file"):
    if ">" not in line:
        a = [ n+1 for n,c in enumerate(line) if c in letters ]
        R="%s-%s" % (a[0],a[-1])
    if R:print p,R


$ ./
>14432|LGIG|186221 3-20
>14432|CAP1|21057 1-20
>14432|HROB|156827 3-20
>14432|NVEC|159589 2-19
>14432|NEOM|short1 17-20
>14432|NEOM|short2 2-8
>14432|TEST|123456 9-20
>14432|TEST|234567 3-20

There is also BioPython if you are interested.

kurumi 10-04-2010 01:00 AM


$ ruby -ne '$_.chomp!;s=$_ if />/;(a=/[A-Z]/=~$_;a+=1; b=/[A-Z].[^A-Z]*$/=~$_;b+=2; print "#{s}, #{a}-#{b}\n") if !/>/' dna

kmkocot 10-04-2010 12:02 PM

Thanks guys! All of those methods work perfectly. Can you think of an easy way I can filter the output so that only organisms with >1 sequence are shown?

Using my example the output would then be like this:

>14432|NEOM|short1        17-20
>14432|NEOM|short2        2-8
>14432|TEST|123456        9-20
>14432|TEST|234567        3-20

Thanks again,

GrapefruiTgirl 10-04-2010 12:16 PM

I don't quite understand the question in post #7. :scratch:

Can you please show an example of something that would get filtered out?

Or am I being dense and missing something obvious?! :)

grail 10-04-2010 06:56 PM

I am with Ggirl, are you asking for the lines that start with >1 or that the position of the first characters is greater than 1 or that the length between characters
is greater than 1?

kmkocot 10-06-2010 02:08 PM

Sorry if that wasn't clear. The headers are formatted like this:

NAME = the name of the organism, always a 4-character abbreviation with A-Z and 0-9 characters only. I'd like to be able to filter the output to just include sequences from organisms with more than one sequence present in the input file.

In the example input file I originally gave, NEOM and TEST were the only organisms with more than one sequence. Therefore, I would like to filter the output to just include the two sequences from each of those organisms.

One idea I had was to make a list of all possible organism abbreviations, grep -c each and if the result was less than 2, do someting like sed '/NAME/,+1d'. The problem with this is my list of organisms varies from input file to input file. Any better ideas would be greatly appreciated.


grail 10-07-2010 09:21 AM

Well I am still not a 100% I am on the right track ... but would this be what you mean?

#!/usr/bin/awk -f

BEGIN{ FS="|" }


    if(++count[name] > 1)




    for(x in count)
        if(count[x] > 1)
            print line[x]

All times are GMT -5. The time now is 10:01 AM.