LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-01-2012, 10:32 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Rep: Reputation: 15
script to truncate lines containing a ">" character to 40 characters


Hi all,

I have a file that looks like this:
Code:
>APOM|Contig4256_149_40_404
MVRLKYWDTKTGKEIAKFVVFNDGEWIIITPEGYFNASKNGAKHLNVRISPTEVASIDQYYDSFYRPDLVKTALRGAKIEEKLRLADIKPAPDVEIVKTPTQVTGDE
>APOM|Contig4254_238_764_1497_some_annotation_that_nobody_cares_about_goes_here
MSLYSDIDLMIIYQDIEGYKSKEIIHKLLYILWDSGLKLGHRVHNLDEIFEVADSDVTIKTAILESRFIDG
>APOM|Contig4253_284_333_1161
MPLKLIFMEQNWYVAIVDRDEGFRFLRIFFITDVKNSARRSYERDLSEADSKRYAQFLKNFQNPMTKFNR
>APOM|Contig4252_94_473_1631
MGCHLGQEDIREFSLDLKEATSTIRGLIKGKILGLSTHNIAEVEEANYLDLDYI
If a header (line beginning with a ">" symbol) is longer than 40 characters, I want to truncate it to be 40 characters.

Desired output:
Code:
>APOM|Contig4256_149_40_404
MVRLKYWDTKTGKEIAKFVVFNDGEWIIITPEGYFNASKNGAKHLNVRISPTEVASIDQYYDSFYRPDLVKTALRGAKIEEKLRLADIKPAPDVEIVKTPTQVTGDE
>APOM|Contig4254_238_764_1497_some_annot
MSLYSDIDLMIIYQDIEGYKSKEIIHKLLYILWDSGLKLGHRVHNLDEIFEVADSDVTIKTAILESRFIDG
>APOM|Contig4253_284_333_1161
MPLKLIFMEQNWYVAIVDRDEGFRFLRIFFITDVKNSARRSYERDLSEADSKRYAQFLKNFQNPMTKFNR
>APOM|Contig4252_94_473_1631
MGCHLGQEDIREFSLDLKEATSTIRGLIKGKILGLSTHNIAEVEEANYLDLDYI
I tried writing a simple shell script but I can't quite get how to specify lines beginning with ">"

Here's what I've got so far:
Code:
while read filename
    do
        header=${^>} #This is the line I'm stuck on
        if [ ${#header} -gt 40 ]
            then
                nfile=$(echo $header | cut -c1-40)
                echo $nfile
            else
                echo $filename
        fi
    done < all.fasta > all.fasta.truncated
Any suggestions would be greatly appreciated.

Thanks!
Kevin
 
Old 02-01-2012, 10:51 PM   #2
jthill
Member
 
Registered: Mar 2010
Distribution: Arch
Posts: 211

Rep: Reputation: 67
sed'll do it, it's built for stuff like this. For my money save anything but the most trivial bash scripts for after you know how to use sed and awk, maybe before perl/python/whatnot and C. sed in particular, there are so many programs and scripts and (on Windows) commercial tools that get written because people don't know how much it can do, how conveniently. Yes, it kinda raises the bar on "quirky".
Code:
sed 's/^\(>.......................................\).*/\1/' <in >out
That's 39 dots.
 
Old 02-02-2012, 12:12 AM   #3
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,241

Rep: Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325
bash string slicing
Code:
s1=">1234567890"
s2="${s1:0:1}"
if [[ "$s2" = '>' ]]
then
    s3="${s1:0:3}"
    echo $s3
fi
Obviously you can shrink that, but its clearer & easier to debug in this form.
 
Old 02-02-2012, 12:21 AM   #4
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 370Reputation: 370Reputation: 370Reputation: 370
jthill's command can be compressed a bit:
Code:
sed 's/^\(>.\{39\}\).*/\1/' <in >out
That way, you do not need to count periods/dots manually.

And similar to chrism01's solution, if you are partial to python:
Code:
#!/usr/bin/python
import sys

for inputLine in sys.stdin:
  if( inputLine[0] == '>' ):
    print ( inputLine[:40].rstrip() )
  else:
    print ( inputLine.rstrip() )

Last edited by Dark_Helmet; 02-02-2012 at 12:22 AM.
 
Old 02-02-2012, 12:27 AM   #5
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,241

Rep: Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325
In Perl I'd replace
Code:
s2="${s1:0:1}"
if [[ "$s2" = '>' ]]
with
Code:
if( $s1 =~ /^>/ )
but wanted to keep soln in pure bash.

Incidentally, bash also has a regex operator '=~' but I can't get it to use the '^' anchor char ie str must 'start' with '>'. Keep getting syntax error.
Anyone know if it can be done?
ie wanted to say
Code:
if [[ $s1 =~ ^> ]]
 
Old 02-02-2012, 12:42 AM   #6
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 370Reputation: 370Reputation: 370Reputation: 370
The syntax error is not from the caret--it's from the greater-than sign. I assume bash is trying to interpret it as output redirection.

I escaped it, and all was well:
Code:
if [[ $s1 =~ ^\> ]]
EDIT:
Quote:
Originally Posted by chrism01
but wanted to keep soln in pure bash
Oh I know. The OP had started in bash, but I just couldn't help myself

Last edited by Dark_Helmet; 02-02-2012 at 12:44 AM.
 
Old 02-02-2012, 02:13 AM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,254

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
For sed you can use -r to ignore the escaping:
Code:
sed -r 's/^(>.{39}).*/\1/' file
Possible awk solution:
Code:
awk '/>/{$0 = substr($0,1,40)}1' file
And maybe some Ruby:
Code:
ruby -ane '$F = $F.join[0,40] if />/; puts $F' file
 
Old 02-02-2012, 08:42 PM   #8
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,241

Rep: Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325
@Dark_Helmet: darn, could have sworn I tried that syntax ... anyway it works

I probably just escaped the caret instead ...
 
Old 02-03-2012, 05:39 PM   #9
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Thanks all!
 
Old 02-04-2012, 07:03 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
In the old single-bracket test, > and < are treated as redirections, and you have to first backslash escape them to \> and \<, at which point they become greater-than/less-than operators. The newer double-bracket test does not treat them as redirectors, so the unescaped values are greater-than/less-than, and escaping or quoting them makes them literal.

Note that these are string, not integer comparisons.

http://mywiki.wooledge.org/BashFAQ/031


When using the regex operator in "[[", it's often better to store the expression in a separate variable first. Then you don't have to worry about escaping anything inside the test construct itself.

Code:
re='^>'
if [[ $s1 =~ $re ]]
Be sure not to quote the regex variable here, or else it will be treated as a string of literal characters.

In this particular instance though, you can also use a simple glob, like so.

Code:
if [[ $s1 == '>'* ]]
Notice again how you can quote/escape the character(s) that need to be matched literally, but be careful to leave the actual globbing character(s) unescaped.

Last edited by David the H.; 02-04-2012 at 07:05 AM. Reason: fixet mestike
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how can I "cat" or "grep" a file to ignore lines starting with "#" ??? callagga Linux - Newbie 7 08-16-2013 07:58 AM
[SOLVED] Script to remove lines in a file with more than "x" instances of any character ? pissed_budgie Programming 12 10-08-2010 09:16 PM
Perl: how to save an e-mail attachment on disk keeping the "&" character (no "%26"!!) d1s4st3r Programming 5 09-29-2010 10:30 PM
Strange "characters" appearing in auto "created" man pages Sector11 Linux - General 7 02-28-2010 12:05 PM
bash - how to remove lines from "FILE_A" which presents in "FILE_B" ? Vilmerok Programming 4 03-13-2009 05:27 AM


All times are GMT -5. The time now is 02:22 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration