[SOLVED] sed or awk help - need to remove text on each line before a regular expression

kmkocot · 10-28-2009, 06:03 PM

Hi all,

I have a file that looks like this:

Code:

>14219|LGIG|61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>14237|LGIG|86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>14286|LGIG|234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>14297|LGIG|139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

I want to go through and remove the first two regions (but not the greater-than symbol) from every other line so that my file looks like this:

Code:

>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

Could anyone help me set up a script to make this happen? It seems like sed and awk could both do this easily but I can't figure out the syntax to tell either to remove all text except the > before the |LGIG|.

Thanks!
Kevin

Telemachos · 10-28-2009, 06:24 PM

Someone can probably do it in awk or sed, but here's a short Perl version. Save this as (say) reader and run it as perl reader filename:

Code:

#!/usr/bin/env perl
use strict;
use warnings;

while (<>) {
    print && next if m/^[A-Z]/;
    print '>', (split /\|/, $_)[2];
}

pixellany · 10-28-2009, 06:32 PM

Code:

sed  '/LGIG/s/^>[0-9]\+|LGIG|/>/' filename > newfilename

For any line containing "LGIG", matches, at the beginning of the line, ">", at least one digit, then "|LGIG|"; and replaces it with ">".

vikas027 · 10-28-2009, 06:34 PM

Here, is the awk code.

This could be done in one line too with sed or probably with sed/awk combination But my knowledge is limited to sed/awk.

/tmp/file is your base file and /tmp/file1 is what you need.

Code:

for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;

OR a one liner

Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;

Hope this helps.

ghostdog74 · 10-28-2009, 06:38 PM

Quote:

Originally Posted by kmkocot

Could anyone help me set up a script to make this happen?

show the script that you had done next time

Code:

$ awk -F"|" '/^>/{print ">"$NF;next}1' file
>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

custangro · 10-28-2009, 06:39 PM

Quote:

Originally Posted by vikas027

Here, is the awk code.

This could be done in one line too with sed or probably with sed/awk combination But my knowledge is limited to sed/awk.

/tmp/file is your base file and /tmp/file1 is what you need.

Code:

for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;

OR a one liner

Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;

Hope this helps.

you don't need to cat the file...

Code:

for i in $(< /tmp/file)

larryhaja · 10-28-2009, 06:40 PM

Code:

#!/bin/sh

for i in `echo test.txt | xargs grep '^>'`; do
  local_var=$(echo $i | awk -F '|' '{print $3}')
  sed -i "/${local_var}/s|>.*|>${local_var}|" test.txt
done

custangro · 10-28-2009, 06:59 PM

Quote:

Originally Posted by larryhaja

Code:

#!/bin/sh

for i in `echo test.txt | xargs grep '^>'`; do
  local_var=$(echo $i | awk -F '|' '{print $3}')
  sed -i "/${local_var}/s|>.*|>${local_var}|" test.txt
done

I don't think echo would work

ghostdog74 · 10-28-2009, 07:15 PM

Quote:

Originally Posted by vikas027

Code:

for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;

If doing it in shell, here's one approach

Code:

while IFS="|" read a b c
do
    case "$a" in 
    ">"*  ) echo ">$c";;
    *) echo $a"|"$b"|"$c
    esac
done < "file"

instead of calling grep and awk for each line, you can use the shell's internal functions. much faster that way

Quote:

OR a one liner

Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;

Hope this helps.

never cram your code into such a one liner like that.

lutusp · 10-28-2009, 07:24 PM

Quote:

Originally Posted by kmkocot

Hi all,

I have a file that looks like this:

Code:

>14219|LGIG|61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>14237|LGIG|86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>14286|LGIG|234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>14297|LGIG|139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

I want to go through and remove the first two regions (but not the greater-than symbol) from every other line so that my file looks like this:

Code:

>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

Could anyone help me set up a script to make this happen? It seems like sed and awk could both do this easily but I can't figure out the syntax to tell either to remove all text except the > before the |LGIG|.

Thanks!
Kevin

Code:

cat data.txt | while read line
do
   [[ $line =~ ">" ]] && echo ">${line##*|}" || echo "$line"
done

Output:

Code:

>61640                          
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

larryhaja · 10-28-2009, 08:08 PM

Quote:

Originally Posted by custangro

I don't think echo would work

It does on my machine.

ghostdog74 · 10-28-2009, 09:18 PM

Quote:

Originally Posted by custangro

I don't think echo would work

it would work because he is piping to xargs and grep, which is practically redundant and can be written just something like..

Code:

grep "^>" test.txt | while read line
do
 ....
done

custangro · 10-28-2009, 11:05 PM

Quote:

Originally Posted by ghostdog74

it would work because he is piping to xargs and grep, which is practically redundant and can be written just something like..

Code:

grep "^>" test.txt | while read line
do
 ....
done

Ok yes your code looks like something that would work...

Maybe I'm just unfamiliar with "xargs" but wouldn't echo test.txt just print "test.txt"?

-C

ghostdog74 · 10-28-2009, 11:33 PM

Quote:

Originally Posted by custangro

ut wouldn't echo test.txt just print "test.txt"?

-C

of course. when you pipe this echo to xargs, xargs just take this "test.txt" and pass to grep as an argument. that's all there is, an extra redundant step.

kmkocot · 10-29-2009, 10:06 AM

Thanks all! I really appreciate your help!