LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   script to truncate lines containing a ">" character to 40 characters (https://www.linuxquestions.org/questions/linux-newbie-8/script-to-truncate-lines-containing-a-character-to-40-characters-927053/)

kmkocot 02-01-2012 09:32 PM

script to truncate lines containing a ">" character to 40 characters
 
Hi all,

I have a file that looks like this:
Code:

>APOM|Contig4256_149_40_404
MVRLKYWDTKTGKEIAKFVVFNDGEWIIITPEGYFNASKNGAKHLNVRISPTEVASIDQYYDSFYRPDLVKTALRGAKIEEKLRLADIKPAPDVEIVKTPTQVTGDE
>APOM|Contig4254_238_764_1497_some_annotation_that_nobody_cares_about_goes_here
MSLYSDIDLMIIYQDIEGYKSKEIIHKLLYILWDSGLKLGHRVHNLDEIFEVADSDVTIKTAILESRFIDG
>APOM|Contig4253_284_333_1161
MPLKLIFMEQNWYVAIVDRDEGFRFLRIFFITDVKNSARRSYERDLSEADSKRYAQFLKNFQNPMTKFNR
>APOM|Contig4252_94_473_1631
MGCHLGQEDIREFSLDLKEATSTIRGLIKGKILGLSTHNIAEVEEANYLDLDYI

If a header (line beginning with a ">" symbol) is longer than 40 characters, I want to truncate it to be 40 characters.

Desired output:
Code:

>APOM|Contig4256_149_40_404
MVRLKYWDTKTGKEIAKFVVFNDGEWIIITPEGYFNASKNGAKHLNVRISPTEVASIDQYYDSFYRPDLVKTALRGAKIEEKLRLADIKPAPDVEIVKTPTQVTGDE
>APOM|Contig4254_238_764_1497_some_annot
MSLYSDIDLMIIYQDIEGYKSKEIIHKLLYILWDSGLKLGHRVHNLDEIFEVADSDVTIKTAILESRFIDG
>APOM|Contig4253_284_333_1161
MPLKLIFMEQNWYVAIVDRDEGFRFLRIFFITDVKNSARRSYERDLSEADSKRYAQFLKNFQNPMTKFNR
>APOM|Contig4252_94_473_1631
MGCHLGQEDIREFSLDLKEATSTIRGLIKGKILGLSTHNIAEVEEANYLDLDYI

I tried writing a simple shell script but I can't quite get how to specify lines beginning with ">"

Here's what I've got so far:
Code:

while read filename
    do
        header=${^>} #This is the line I'm stuck on
        if [ ${#header} -gt 40 ]
            then
                nfile=$(echo $header | cut -c1-40)
                echo $nfile
            else
                echo $filename
        fi
    done < all.fasta > all.fasta.truncated

Any suggestions would be greatly appreciated.

Thanks!
Kevin

jthill 02-01-2012 09:51 PM

sed'll do it, it's built for stuff like this. For my money save anything but the most trivial bash scripts for after you know how to use sed and awk, maybe before perl/python/whatnot and C. sed in particular, there are so many programs and scripts and (on Windows) commercial tools that get written because people don't know how much it can do, how conveniently. Yes, it kinda raises the bar on "quirky".
Code:

sed 's/^\(>.......................................\).*/\1/' <in >out
That's 39 dots.

chrism01 02-01-2012 11:12 PM

bash string slicing
Code:

s1=">1234567890"
s2="${s1:0:1}"
if [[ "$s2" = '>' ]]
then
    s3="${s1:0:3}"
    echo $s3
fi

Obviously you can shrink that, but its clearer & easier to debug in this form.

Dark_Helmet 02-01-2012 11:21 PM

jthill's command can be compressed a bit:
Code:

sed 's/^\(>.\{39\}\).*/\1/' <in >out
That way, you do not need to count periods/dots manually. :)

And similar to chrism01's solution, if you are partial to python:
Code:

#!/usr/bin/python
import sys

for inputLine in sys.stdin:
  if( inputLine[0] == '>' ):
    print ( inputLine[:40].rstrip() )
  else:
    print ( inputLine.rstrip() )


chrism01 02-01-2012 11:27 PM

In Perl I'd replace
Code:

s2="${s1:0:1}"
if [[ "$s2" = '>' ]]

with
Code:

if( $s1 =~ /^>/ )
but wanted to keep soln in pure bash.

Incidentally, bash also has a regex operator '=~' but I can't get it to use the '^' anchor char ie str must 'start' with '>'. Keep getting syntax error.
Anyone know if it can be done?
ie wanted to say
Code:

if [[ $s1 =~ ^> ]]

Dark_Helmet 02-01-2012 11:42 PM

The syntax error is not from the caret--it's from the greater-than sign. I assume bash is trying to interpret it as output redirection.

I escaped it, and all was well:
Code:

if [[ $s1 =~ ^\> ]]
EDIT:
Quote:

Originally Posted by chrism01
but wanted to keep soln in pure bash

Oh I know. The OP had started in bash, but I just couldn't help myself :)

grail 02-02-2012 01:13 AM

For sed you can use -r to ignore the escaping:
Code:

sed -r 's/^(>.{39}).*/\1/' file
Possible awk solution:
Code:

awk '/>/{$0 = substr($0,1,40)}1' file
And maybe some Ruby:
Code:

ruby -ane '$F = $F.join[0,40] if />/; puts $F' file

chrism01 02-02-2012 07:42 PM

@Dark_Helmet: darn, could have sworn I tried that syntax ... anyway it works :)

I probably just escaped the caret instead ...

kmkocot 02-03-2012 04:39 PM

Thanks all!

David the H. 02-04-2012 06:03 AM

In the old single-bracket test, > and < are treated as redirections, and you have to first backslash escape them to \> and \<, at which point they become greater-than/less-than operators. The newer double-bracket test does not treat them as redirectors, so the unescaped values are greater-than/less-than, and escaping or quoting them makes them literal.

Note that these are string, not integer comparisons.

http://mywiki.wooledge.org/BashFAQ/031


When using the regex operator in "[[", it's often better to store the expression in a separate variable first. Then you don't have to worry about escaping anything inside the test construct itself.

Code:

re='^>'
if [[ $s1 =~ $re ]]

Be sure not to quote the regex variable here, or else it will be treated as a string of literal characters.

In this particular instance though, you can also use a simple glob, like so.

Code:

if [[ $s1 == '>'* ]]
Notice again how you can quote/escape the character(s) that need to be matched literally, but be careful to leave the actual globbing character(s) unescaped.


All times are GMT -5. The time now is 03:16 PM.