LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   sed or awk help - need to remove text on each line before a regular expression (https://www.linuxquestions.org/questions/linux-newbie-8/sed-or-awk-help-need-to-remove-text-on-each-line-before-a-regular-expression-765142/)

kmkocot 10-28-2009 06:03 PM

sed or awk help - need to remove text on each line before a regular expression
 
Hi all,

I have a file that looks like this:
Code:

>14219|LGIG|61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>14237|LGIG|86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>14286|LGIG|234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>14297|LGIG|139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

I want to go through and remove the first two regions (but not the greater-than symbol) from every other line so that my file looks like this:

Code:

>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

Could anyone help me set up a script to make this happen? It seems like sed and awk could both do this easily but I can't figure out the syntax to tell either to remove all text except the > before the |LGIG|.

Thanks!
Kevin

Telemachos 10-28-2009 06:24 PM

Someone can probably do it in awk or sed, but here's a short Perl version. Save this as (say) reader and run it as perl reader filename:
Code:

#!/usr/bin/env perl
use strict;
use warnings;

while (<>) {
    print && next if m/^[A-Z]/;
    print '>', (split /\|/, $_)[2];
}


pixellany 10-28-2009 06:32 PM

Code:

sed  '/LGIG/s/^>[0-9]\+|LGIG|/>/' filename > newfilename
For any line containing "LGIG", matches, at the beginning of the line, ">", at least one digit, then "|LGIG|"; and replaces it with ">".

vikas027 10-28-2009 06:34 PM

Here, is the awk code.

This could be done in one line too with sed or probably with sed/awk combination But my knowledge is limited to sed/awk.

/tmp/file is your base file and /tmp/file1 is what you need.

Code:

for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;

OR a one liner ;)
Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;
Hope this helps.

ghostdog74 10-28-2009 06:38 PM

Quote:

Originally Posted by kmkocot (Post 3735886)
Could anyone help me set up a script to make this happen?

show the script that you had done next time
Code:

$ awk -F"|" '/^>/{print ">"$NF;next}1' file
>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI


custangro 10-28-2009 06:39 PM

Quote:

Originally Posted by vikas027 (Post 3735908)
Here, is the awk code.

This could be done in one line too with sed or probably with sed/awk combination But my knowledge is limited to sed/awk.

/tmp/file is your base file and /tmp/file1 is what you need.

Code:

for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;

OR a one liner ;)
Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;
Hope this helps.

you don't need to cat the file...

Code:

for i in $(< /tmp/file)

larryhaja 10-28-2009 06:40 PM

Code:

#!/bin/sh

for i in `echo test.txt | xargs grep '^>'`; do
  local_var=$(echo $i | awk -F '|' '{print $3}')
  sed -i "/${local_var}/s|>.*|>${local_var}|" test.txt
done


custangro 10-28-2009 06:59 PM

Quote:

Originally Posted by larryhaja (Post 3735916)
Code:

#!/bin/sh

for i in `echo test.txt | xargs grep '^>'`; do
  local_var=$(echo $i | awk -F '|' '{print $3}')
  sed -i "/${local_var}/s|>.*|>${local_var}|" test.txt
done


I don't think echo would work

ghostdog74 10-28-2009 07:15 PM

Quote:

Originally Posted by vikas027 (Post 3735908)
Code:

for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;


If doing it in shell, here's one approach
Code:

while IFS="|" read a b c
do
    case "$a" in
    ">"*  ) echo ">$c";;
    *) echo $a"|"$b"|"$c
    esac
done < "file"

instead of calling grep and awk for each line, you can use the shell's internal functions. much faster that way


Quote:

OR a one liner ;)
Code:

for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;
Hope this helps.
never cram your code into such a one liner like that.

lutusp 10-28-2009 07:24 PM

Quote:

Originally Posted by kmkocot (Post 3735886)
Hi all,

I have a file that looks like this:
Code:

>14219|LGIG|61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>14237|LGIG|86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>14286|LGIG|234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>14297|LGIG|139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

I want to go through and remove the first two regions (but not the greater-than symbol) from every other line so that my file looks like this:

Code:

>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI

Could anyone help me set up a script to make this happen? It seems like sed and awk could both do this easily but I can't figure out the syntax to tell either to remove all text except the > before the |LGIG|.

Thanks!
Kevin

Code:

cat data.txt | while read line
do
  [[ $line =~ ">" ]] && echo ">${line##*|}" || echo "$line"
done

Output:

Code:

>61640                         
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI


larryhaja 10-28-2009 08:08 PM

Quote:

Originally Posted by custangro (Post 3735932)
I don't think echo would work

It does on my machine.

ghostdog74 10-28-2009 09:18 PM

Quote:

Originally Posted by custangro (Post 3735932)
I don't think echo would work

it would work because he is piping to xargs and grep, which is practically redundant and can be written just something like..
Code:

grep "^>" test.txt | while read line
do
 ....
done


custangro 10-28-2009 11:05 PM

Quote:

Originally Posted by ghostdog74 (Post 3736041)
it would work because he is piping to xargs and grep, which is practically redundant and can be written just something like..
Code:

grep "^>" test.txt | while read line
do
 ....
done


Ok yes your code looks like something that would work...

Maybe I'm just unfamiliar with "xargs" but wouldn't echo test.txt just print "test.txt"?

-C

ghostdog74 10-28-2009 11:33 PM

Quote:

Originally Posted by custangro (Post 3736101)
ut wouldn't echo test.txt just print "test.txt"?

-C

of course. when you pipe this echo to xargs, xargs just take this "test.txt" and pass to grep as an argument. that's all there is, an extra redundant step.

kmkocot 10-29-2009 10:06 AM

Thanks all! I really appreciate your help!


All times are GMT -5. The time now is 12:22 AM.