LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 10-03-2010, 07:49 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 108

Rep: Reputation: 15
Script to get position #s of first and last A-Z character on certain lines?


Hi all,

I have a set of files containing DNA or amino acid sequences from various organisms:

Code:
>14432|LGIG|186221
--MISVLAMA-NRITAAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALXSIC?AEKR
>14432|NVEC|159589
-VRVNVLN-ALNSICNAEX-
>14432|NEOM|short1
----------------AEKR
>14432|NEOM|short2
-VRVNVLN------------
>14432|TEST|123456
--------XALNSICNAEKR
>14432|TEST|234567
--RMNVLADALXSICNAEKR
The even-numbered lines beginning with a greater-than symbol are the headers with 3 fields separated by pipe symbols. The number in the first field is the gene ID, the 4-letter abbreviation in the second field is the organism ID, and the third field contains a unique sequence ID. There may be 2+ sequences from an organism (e.g., NEOM and TEST). The lines containing only A-Z characters, question marks, and dashes are amino acid sequences (-, ?, and X represent gaps or missing data). These example sequences are really short and in reality most sequences are several hundred amino acids long.

I would like to write a script to print the range of character positions between the first and last non-missing data characters on the amino acid sequence lines. Internal missing data characters flanked by A-Z characters are OK. Here's what I have in mind for the desired output:

Code:
>14432|LGIG|186221	3-20
>14432|CAP1|21057	1-20
>14432|HROB|156827	3-20
>14432|NVEC|159589	2-19
>14432|NEOM|short1	17-20
>14432|NEOM|short2	2-8
>14432|TEST|123456	9-20
>14432|TEST|234567	3-20
Can anyone help me get the position of the first and last non-missing data characters (while allowing missing data characters in the middle of the sequence)? I'm sure it is a simple sed or awk command but I can't figure it out. I think I can produce the output file I want once I have figured those commands out.

My ultimate goal is to write a script that can make composite sequences from two or more non-overlapping sequences (e.g., the two sequences from NEOM). I may also want to merge sequences that partially overlap (e.g., those from TEST) but that would complicate things. Is this a logical first step for such a script or would you do it differently?

Thanks!!!
Kevin
 
Old 10-03-2010, 08:19 PM   #2
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Well, in Perl regular expressions can be used: http://perldoc.perl.org -> http://perldoc.perl.org/perlretut.html .

In general, there is BioPerl: http://www.bioperl.org/wiki/Main_Page - maybe this or similar task has already been solved.
 
1 members found this post helpful.
Old 10-03-2010, 08:38 PM   #3
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
awk - count DNA bits & bytes.

Gotta like these interesting DNA problems you keep having.

Try this:
Code:
#!/usr/bin/awk -f


BEGIN { RS = "\n>"
}

{
 if ($1 != "") {
  strlen=length($2)
  for (x=1; x<=strlen; x++){
   if (substr($2,x,1) ~ /[A-Z]/) {start = x; break}
  }
  for (x=strlen; x>start; x--) {
   if (substr($2,x,1) ~ /[A-Z]/) {end = x; break}
  }

  printf ">" $1"\t"start" - "end"\n"
 }
}
Code:
sasha@reactor: ./dna-positions.awk dnadata
>14432|LGIG|186221      3 - 20
>14432|CAP1|21057       1 - 20
>14432|HROB|156827      3 - 20
>14432|NVEC|159589      2 - 19
>14432|NEOM|short1      17 - 20
>14432|NEOM|short2      2 - 8
>14432|TEST|123456      9 - 20
>14432|TEST|234567      3 - 20
sasha@reactor:
Only problem I have with it, and it's a minor one in my opinion, is the empty line first produced because of what I used as a record separator.

Cheers!

Last edited by GrapefruiTgirl; 10-03-2010 at 08:41 PM. Reason: EDIT: Fixed that empty line thing.
 
1 members found this post helpful.
Old 10-03-2010, 10:39 PM   #4
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,654

Rep: Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967
Another alternative in awk would be:
Code:
awk '/^>/{printf "%s\t",$0;next}{start=match($0,/[A-Z]+.*[A-Z]/);print start"-"start+RLENGTH-1}' file
 
1 members found this post helpful.
Old 10-04-2010, 12:51 AM   #5
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Python,
Code:
import string
letters=string.ascii_letters
for line in open("file"):
    a=[];R=""
    if ">" not in line:
        a = [ n+1 for n,c in enumerate(line) if c in letters ]
        R="%s-%s" % (a[0],a[-1])
    if R:print p,R
    p=line.rstrip()
output
Code:
$ ./python.py
>14432|LGIG|186221 3-20
>14432|CAP1|21057 1-20
>14432|HROB|156827 3-20
>14432|NVEC|159589 2-19
>14432|NEOM|short1 17-20
>14432|NEOM|short2 2-8
>14432|TEST|123456 9-20
>14432|TEST|234567 3-20
There is also BioPython if you are interested.

Last edited by ghostdog74; 10-04-2010 at 12:56 AM.
 
1 members found this post helpful.
Old 10-04-2010, 02:00 AM   #6
kurumi
Member
 
Registered: Apr 2010
Posts: 223

Rep: Reputation: 45
Code:
$ ruby -ne '$_.chomp!;s=$_ if />/;(a=/[A-Z]/=~$_;a+=1; b=/[A-Z].[^A-Z]*$/=~$_;b+=2; print "#{s}, #{a}-#{b}\n") if !/>/' dna
 
1 members found this post helpful.
Old 10-04-2010, 01:02 PM   #7
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 108

Original Poster
Rep: Reputation: 15
Thanks guys! All of those methods work perfectly. Can you think of an easy way I can filter the output so that only organisms with >1 sequence are shown?

Using my example the output would then be like this:
Code:
>14432|NEOM|short1	17-20
>14432|NEOM|short2	2-8
>14432|TEST|123456	9-20
>14432|TEST|234567	3-20
Thanks again,
Kevin
 
Old 10-04-2010, 01:16 PM   #8
GrapefruiTgirl
Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
I don't quite understand the question in post #7.

Can you please show an example of something that would get filtered out?

Or am I being dense and missing something obvious?!
 
Old 10-04-2010, 07:56 PM   #9
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,654

Rep: Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967
I am with Ggirl, are you asking for the lines that start with >1 or that the position of the first characters is greater than 1 or that the length between characters
is greater than 1?
 
Old 10-06-2010, 03:08 PM   #10
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 108

Original Poster
Rep: Reputation: 15
Sorry if that wasn't clear. The headers are formatted like this:
Code:
>Gene_number|NAME|some_other_annotation
NAME = the name of the organism, always a 4-character abbreviation with A-Z and 0-9 characters only. I'd like to be able to filter the output to just include sequences from organisms with more than one sequence present in the input file.

In the example input file I originally gave, NEOM and TEST were the only organisms with more than one sequence. Therefore, I would like to filter the output to just include the two sequences from each of those organisms.

One idea I had was to make a list of all possible organism abbreviations, grep -c each and if the result was less than 2, do someting like sed '/NAME/,+1d'. The problem with this is my list of organisms varies from input file to input file. Any better ideas would be greatly appreciated.

Thanks!
Kevin
 
Old 10-07-2010, 10:21 AM   #11
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,654

Rep: Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967Reputation: 1967
Well I am still not a 100% I am on the right track ... but would this be what you mean?
Code:
#!/usr/bin/awk -f

BEGIN{ FS="|" }

/^>/{
    name=$2

    if(++count[name] > 1)
	line[name]=line[name]"\n"$0
    else 
	line[name]=$0

    next
}

{
    len=match($0,/[A-Z]+.*[A-Z]/)

    line[name]=line[name]"\t"len"-"len+RLENGTH-1
}

END{
    for(x in count)
	if(count[x] > 1)
	    print line[x]
}
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] string editing: how to remove lines consisting of a single character? recomboDNA Programming 2 07-16-2010 09:55 AM
How do I remove lines that have more than one of a certain character in them? SentralOrigin Linux - General 2 01-14-2009 02:40 PM
Replacing character position within the string of a file scroy Linux - Newbie 13 11-08-2008 06:45 PM
Inspected a character on a particular position in a word kushalkoolwal Programming 10 07-03-2008 09:37 PM
vi command to find and delete all lines beginning with a character geomatt Linux - Software 8 12-20-2004 04:30 AM


All times are GMT -5. The time now is 01:55 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration