LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-28-2009, 07:03 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Rep: Reputation: 15
sed or awk help - need to remove text on each line before a regular expression


Hi all,

I have a file that looks like this:
Code:
>14219|LGIG|61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>14237|LGIG|86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>14286|LGIG|234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>14297|LGIG|139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI
I want to go through and remove the first two regions (but not the greater-than symbol) from every other line so that my file looks like this:

Code:
>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI
Could anyone help me set up a script to make this happen? It seems like sed and awk could both do this easily but I can't figure out the syntax to tell either to remove all text except the > before the |LGIG|.

Thanks!
Kevin
 
Old 10-28-2009, 07:24 PM   #2
Telemachos
Member
 
Registered: May 2007
Distribution: Debian
Posts: 754

Rep: Reputation: 59
Someone can probably do it in awk or sed, but here's a short Perl version. Save this as (say) reader and run it as perl reader filename:
Code:
#!/usr/bin/env perl
use strict;
use warnings;

while (<>) {
    print && next if m/^[A-Z]/;
    print '>', (split /\|/, $_)[2];
}
 
Old 10-28-2009, 07:32 PM   #3
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738
Code:
sed  '/LGIG/s/^>[0-9]\+|LGIG|/>/' filename > newfilename
For any line containing "LGIG", matches, at the beginning of the line, ">", at least one digit, then "|LGIG|"; and replaces it with ">".
 
Old 10-28-2009, 07:34 PM   #4
vikas027
Senior Member
 
Registered: May 2007
Location: Sydney
Distribution: RHEL, CentOS, Debian, OS X
Posts: 1,298

Rep: Reputation: 102Reputation: 102
Thumbs up

Here, is the awk code.

This could be done in one line too with sed or probably with sed/awk combination But my knowledge is limited to sed/awk.

/tmp/file is your base file and /tmp/file1 is what you need.

Code:
for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;
OR a one liner
Code:
for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;
Hope this helps.

Last edited by vikas027; 10-28-2009 at 07:37 PM. Reason: Need to print $3 in awk
 
Old 10-28-2009, 07:38 PM   #5
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by kmkocot View Post
Could anyone help me set up a script to make this happen?
show the script that you had done next time
Code:
$ awk -F"|" '/^>/{print ">"$NF;next}1' file
>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI
 
Old 10-28-2009, 07:39 PM   #6
custangro
Senior Member
 
Registered: Nov 2006
Location: California
Distribution: Fedora , CentOS , RHEL
Posts: 1,971
Blog Entries: 1

Rep: Reputation: 208Reputation: 208Reputation: 208
Quote:
Originally Posted by vikas027 View Post
Here, is the awk code.

This could be done in one line too with sed or probably with sed/awk combination But my knowledge is limited to sed/awk.

/tmp/file is your base file and /tmp/file1 is what you need.

Code:
for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;
OR a one liner
Code:
for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;
Hope this helps.
you don't need to cat the file...

Code:
for i in $(< /tmp/file)
 
Old 10-28-2009, 07:40 PM   #7
larryhaja
Member
 
Registered: Jul 2008
Distribution: Slackware 13.1
Posts: 305

Rep: Reputation: 80
Code:
#!/bin/sh

for i in `echo test.txt | xargs grep '^>'`; do
  local_var=$(echo $i | awk -F '|' '{print $3}')
  sed -i "/${local_var}/s|>.*|>${local_var}|" test.txt
done
 
Old 10-28-2009, 07:59 PM   #8
custangro
Senior Member
 
Registered: Nov 2006
Location: California
Distribution: Fedora , CentOS , RHEL
Posts: 1,971
Blog Entries: 1

Rep: Reputation: 208Reputation: 208Reputation: 208
Quote:
Originally Posted by larryhaja View Post
Code:
#!/bin/sh

for i in `echo test.txt | xargs grep '^>'`; do
  local_var=$(echo $i | awk -F '|' '{print $3}')
  sed -i "/${local_var}/s|>.*|>${local_var}|" test.txt
done
I don't think echo would work
 
Old 10-28-2009, 08:15 PM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by vikas027 View Post
Code:
for i in `cat /tmp/file`
do
echo "$i" | grep "^>"
if [ $? -eq 0 ]
then
echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1
else
echo "$i" >> /tmp/file1
fi
done;
If doing it in shell, here's one approach
Code:
while IFS="|" read a b c
do
    case "$a" in 
    ">"*  ) echo ">$c";;
    *) echo $a"|"$b"|"$c
    esac
done < "file"
instead of calling grep and awk for each line, you can use the shell's internal functions. much faster that way


Quote:
OR a one liner
Code:
for i in `cat /tmp/file`; do echo "$i" | grep "^>"; if [ $? -eq 0 ]; then echo "$i" | awk -F"|" '{print $3}' >> /tmp/file1; else echo "$i" >> /tmp/file1; fi; done;
Hope this helps.
never cram your code into such a one liner like that.
 
Old 10-28-2009, 08:24 PM   #10
lutusp
Member
 
Registered: Sep 2009
Distribution: Fedora
Posts: 835

Rep: Reputation: 102Reputation: 102
Quote:
Originally Posted by kmkocot View Post
Hi all,

I have a file that looks like this:
Code:
>14219|LGIG|61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>14237|LGIG|86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>14286|LGIG|234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>14297|LGIG|139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI
I want to go through and remove the first two regions (but not the greater-than symbol) from every other line so that my file looks like this:

Code:
>61640
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI
Could anyone help me set up a script to make this happen? It seems like sed and awk could both do this easily but I can't figure out the syntax to tell either to remove all text except the > before the |LGIG|.

Thanks!
Kevin
Code:
cat data.txt | while read line
do
   [[ $line =~ ">" ]] && echo ">${line##*|}" || echo "$line"
done
Output:

Code:
>61640                          
MSFEFTIPINLDCLLSKTNVSQYVVEEVLPLRIIPGAVQDFKFAVRNDNFA
>86853
PPAGPQQPMVSPNKIVNAATFCRFGQEYIHEIITKATEIFGS
>234779
MYIASFVLKMVSNRFLVKVAIGGAIFTLTSISGMKIYIENKFQRQDFYLKSMDLL
>139771
QENQSDISQALNQQSDLIEGIYEGGLTIWECGIDLVNYLI
 
Old 10-28-2009, 09:08 PM   #11
larryhaja
Member
 
Registered: Jul 2008
Distribution: Slackware 13.1
Posts: 305

Rep: Reputation: 80
Quote:
Originally Posted by custangro View Post
I don't think echo would work
It does on my machine.
 
Old 10-28-2009, 10:18 PM   #12
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by custangro View Post
I don't think echo would work
it would work because he is piping to xargs and grep, which is practically redundant and can be written just something like..
Code:
grep "^>" test.txt | while read line
do
 ....
done
 
Old 10-29-2009, 12:05 AM   #13
custangro
Senior Member
 
Registered: Nov 2006
Location: California
Distribution: Fedora , CentOS , RHEL
Posts: 1,971
Blog Entries: 1

Rep: Reputation: 208Reputation: 208Reputation: 208
Quote:
Originally Posted by ghostdog74 View Post
it would work because he is piping to xargs and grep, which is practically redundant and can be written just something like..
Code:
grep "^>" test.txt | while read line
do
 ....
done
Ok yes your code looks like something that would work...

Maybe I'm just unfamiliar with "xargs" but wouldn't echo test.txt just print "test.txt"?

-C
 
Old 10-29-2009, 12:33 AM   #14
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by custangro View Post
ut wouldn't echo test.txt just print "test.txt"?

-C
of course. when you pipe this echo to xargs, xargs just take this "test.txt" and pass to grep as an argument. that's all there is, an extra redundant step.
 
Old 10-29-2009, 11:06 AM   #15
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Thanks all! I really appreciate your help!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] sed, awk, Keep only text between two regular expressions scott_audio Linux - Newbie 9 08-06-2009 03:46 PM
sed - regular expression Vilmerok Programming 5 02-26-2009 09:44 AM
again stucked with text processing (sed/awk/perl), copy the line and change rahmathullakm Programming 4 01-19-2009 02:53 PM
sed regular expression Ammad Linux - General 7 10-29-2008 06:52 PM
bash/sed/awk fill each line in text file with space to fixed length khairil Programming 11 01-09-2008 06:28 AM


All times are GMT -5. The time now is 11:08 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration