LinuxQuestions.org - [SOLVED] sed or awk help - need to remove text on each line before a regular expression

Page 1 of 2

Show 50 post(s) from this thread on one page

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - sed or awk help - need to remove text on each line before a regular expression (https://www.linuxquestions.org/questions/linux-newbie-8/sed-or-awk-help-need-to-remove-text-on-each-line-before-a-regular-expression-765142/)

kmkocot

10-28-2009 06:03 PM

sed or awk help - need to remove text on each line before a regular expression

Hi all,

I have a file that looks like this:

Code:

>14219|LGIG|61640

MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA

>14237|LGIG|86853

PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS

>14286|LGIG|234779

MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL

>14297|LGIG|139771

QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

I want to go through and remove the first two regions (but not the greater-than symbol) from every other line so that my file looks like this:

Code:

>61640

MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA

>86853

PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS

>234779

MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL

>139771

QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

Could anyone help me set up a script to make this happen? It seems like sed and awk could both do this easily but I can't figure out the syntax to tell either to remove all text except the > before the |LGIG|.

Thanks!
Kevin

Telemachos

10-28-2009 06:24 PM

Someone can probably do it in awk or sed, but here's a short Perl version. Save this as (say) reader and run it as perl reader filename:

Code:

#!/usr/bin/env perl

use strict;

use warnings;



while (<>) {

    print && next if m/^[A-Z]/;

    print '>', (split /\|/, $_)[2];

}

pixellany

10-28-2009 06:32 PM

Code:

sed '/LGIG/s/^>[0-9]\+|LGIG|/>/' filename > newfilename

For any line containing "LGIG", matches, at the beginning of the line, ">", at least one digit, then "|LGIG|"; and replaces it with ">".

vikas027

10-28-2009 06:34 PM

Here, is the awk code.

This could be done in one line too with sed or probably with sed/awk combination But my knowledge is limited to sed/awk.

/tmp/file is your base file and /tmp/file1 is what you need.

Code:

for i in `cat /tmp/file`

do

echo "$i" | grep "^>"

if [ $? -eq 0 ]

then

echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1

else

echo "$i" >> /tmp/file1

fi

done;

OR a one liner ;)

Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;

Hope this helps.

ghostdog74

10-28-2009 06:38 PM

Quote:

Originally Posted by kmkocot (Post 3735886)

Could anyone help me set up a script to make this happen?

show the script that you had done next time

Code:

$ awk -F"|" '/^>/{print ">"$NF;next}1' file

>61640

MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA

>86853

PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS

>234779

MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL

>139771

QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

custangro

10-28-2009 06:39 PM

Quote:

Originally Posted by vikas027 (Post 3735908)

Code:

for i in `cat /tmp/file`

do

echo "$i" | grep "^>"

if [ $? -eq 0 ]

then

echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1

else

echo "$i" >> /tmp/file1

fi

done;

OR a one liner ;)

Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;

Hope this helps.

you don't need to cat the file...

Code:

for i in $(< /tmp/file)

larryhaja

10-28-2009 06:40 PM

Code:

#!/bin/sh



for i in `echo test.txt | xargs grep '^>'`; do

  local_var=$(echo $i | awk -F '|' '{print $3}')

  sed -i "/${local_var}/s|>.*|>${local_var}|" test.txt

done

custangro

10-28-2009 06:59 PM

Quote:

Originally Posted by larryhaja (Post 3735916)

Code:

#!/bin/sh



for i in `echo test.txt | xargs grep '^>'`; do

  local_var=$(echo $i | awk -F '|' '{print $3}')

  sed -i "/${local_var}/s|>.*|>${local_var}|" test.txt

done

I don't think echo would work

ghostdog74

10-28-2009 07:15 PM

Quote:

Originally Posted by vikas027 (Post 3735908)

Code:

for i in `cat /tmp/file`

do

echo "$i" | grep "^>"

if [ $? -eq 0 ]

then

echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1

else

echo "$i" >> /tmp/file1

fi

done;

If doing it in shell, here's one approach

Code:

while IFS="|" read a b c

do

    case "$a" in 

    ">"*  ) echo ">$c";;

    *) echo $a"|"$b"|"$c

    esac

done < "file"

instead of calling grep and awk for each line, you can use the shell's internal functions. much faster that way

Quote:

OR a one liner ;)

Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;

Hope this helps.

never cram your code into such a one liner like that.

lutusp

10-28-2009 07:24 PM

Quote:

Originally Posted by kmkocot (Post 3735886)

Hi all,

I have a file that looks like this:

Code:

>14219|LGIG|61640

MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA

>14237|LGIG|86853

PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS

>14286|LGIG|234779

MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL

>14297|LGIG|139771

QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

I want to go through and remove the first two regions (but not the greater-than symbol) from every other line so that my file looks like this:

Code:

>61640

MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA

>86853

PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS

>234779

MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL

>139771

QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

Code:

cat data.txt | while read line

do

  [[ $line =~ ">" ]] && echo ">${line##*|}" || echo "$line"

done

Output:

Code:

>61640                          

MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA

>86853

PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS

>234779

MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL

>139771

QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

larryhaja

10-28-2009 08:08 PM

Quote:

Originally Posted by custangro (Post 3735932)

I don't think echo would work

It does on my machine.

ghostdog74

10-28-2009 09:18 PM

Quote:

Originally Posted by custangro (Post 3735932)

I don't think echo would work

it would work because he is piping to xargs and grep, which is practically redundant and can be written just something like..

Code:

grep "^>" test.txt | while read line

do

 ....

done

custangro

10-28-2009 11:05 PM

Quote:

Originally Posted by ghostdog74 (Post 3736041)

it would work because he is piping to xargs and grep, which is practically redundant and can be written just something like..

Code:

grep "^>" test.txt | while read line

do

 ....

done

Ok yes your code looks like something that would work...

Maybe I'm just unfamiliar with "xargs" but wouldn't echo test.txt just print "test.txt"?

-C

ghostdog74

10-28-2009 11:33 PM

Quote:

Originally Posted by custangro (Post 3736101)

ut wouldn't echo test.txt just print "test.txt"?

-C

of course. when you pipe this echo to xargs, xargs just take this "test.txt" and pass to grep as an argument. that's all there is, an extra redundant step.

kmkocot

10-29-2009 10:06 AM

Thanks all! I really appreciate your help!

All times are GMT -5. The time now is 12:22 AM.

Page 1 of 2

Show 50 post(s) from this thread on one page