LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 01-07-2008, 02:13 PM   #1
new_2_unix
LQ Newbie
 
Registered: Oct 2007
Posts: 26

Rep: Reputation: 15
how to look for the shortest match using regex, bascially the opposite of .*


hi,

i'm have a problem in the following situation:

suppose, i have a string "Scrapple from the apple."

then, if i use the regular expression "a.*e", it will match: "apple from the apple", because by definition using the .* will match the longest string that will match the regular expression.

and it won't match: "apple" OR "apple from the" even though these also start with "a" and end with an "e".

my problem is that instead of looking for the longest match, i want the shortest match. i've looked at the tutorials, but am still at a loss on how to do this. any help will be much appreciated.
 
Old 01-07-2008, 02:45 PM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Here is just one crude method (using sed):

sed 's/a.\{,3\}e/G/g' filename

This matches the pattern: "a" + a maximum of 3 characters + "e", and replaces it with "G" for all occurences.
 
Old 01-07-2008, 02:52 PM   #3
Poetics
Senior Member
 
Registered: Jun 2003
Location: California
Distribution: Slackware
Posts: 1,181

Rep: Reputation: 49
You want the "non-greedy" matching operators. In perl, for example, if you used .+? it will match on the first character (beware with using .*? -- it will happily match on 0 characters and end).
 
Old 01-07-2008, 03:12 PM   #4
new_2_unix
LQ Newbie
 
Registered: Oct 2007
Posts: 26

Original Poster
Rep: Reputation: 15
here is an example of what i'm trying to do.
i'm trying to delete everything between and including <tag1> and </tag1>. but anything that's outside of this should not be deleted.

i'm doing this with a sed script, and the regex is not working.

[HTML]
<html><body><tag1>This is inside tag1. This should be deleted.</tag1>This is the first statement outside of tag1. This should NOT be deleted.<tag1>This is once again inside tag1. This should be deleted as well.</tag1> This is the second statement outside tag1. This should NOT be deleted.</body></html>
[/HTML]

i've tried the following:
in this one the problem is that it deletes the first line outside <tag1> as well.
Code:
$cat test1 | sed 's/<tag1>.*<\/tag1>//'

<html><body>This is the second statement outside tag1. This should not be deleted.</body></html>
in this one the problem is that it does not delete anything:
Code:
$cat test1 | sed 's/<tag1>.+?<\/tag1>//'

<html><body><tag1>This is inside tag1. This should be deleted.</tag1>This is the first statement outside tag1. This should not be deleted.<tag1>This is once again inside tag1. This should be deleted as well.</tag1>This is the second statement outside tag1. This should not be deleted.</body></html>
i've tried several variants of the regex above... but am still at a loss on how to do this... any guidance will be helpful. thanks!
 
Old 01-07-2008, 03:27 PM   #5
Poetics
Senior Member
 
Registered: Jun 2003
Location: California
Distribution: Slackware
Posts: 1,181

Rep: Reputation: 49
What about using "s/(^.*<tag1>).*(</tag1>.*$)/\1\2/"? That will save everything up to and including <tag1> from the start of the line, and then save everything after and including </tag1>, to the end of the line, chopping the middle. An inelegant solution, I know, but something that may work until something better comes along.
 
Old 01-07-2008, 03:32 PM   #6
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
In your first example, the regex is "greedy"--ie it goes all the way to the last instance of </tag1>.
In addition to my earlier crude solution (max # of characters), you could also do this:

sed -e 's/\/tag1/TAGONE/' -e 's/<tag1>.*<TAGONE>//' (By replacing only the first instance of "/tag1" you create an unambiguous endpoint for the second SED command.)

My favorite SED tutorial:
http://www.grymoire.com/Unix/Sed.html

Quote:
.+?
I don't know if this is legal--I have never seen in in Bash. "+" means one or more, and "?" means optional---I don't know what the combo would mean. Maybe try it on something simple.....
 
Old 01-07-2008, 03:42 PM   #7
Poetics
Senior Member
 
Registered: Jun 2003
Location: California
Distribution: Slackware
Posts: 1,181

Rep: Reputation: 49
Quote:
Originally Posted by pixellany View Post
I don't know if this is legal--I have never seen in in Bash. "+" means one or more, and "?" means optional---I don't know what the combo would mean. Maybe try it on something simple.....
His question just mentioned regular expressions, context-unspecific. In perl (where I do 99% of my personal RegExp work) one can use the '?' operator to turn a greedy operator into a non-greedy one.
 
Old 01-08-2008, 09:29 AM   #8
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
This works:
Code:
sed 's/[0-9]\{2\},\+\?[0-9]\{2\}/DDD/g' filename
this finds all instances of 2 digits + optional 1 or more commas, plus 2 more digits.

From all my reading, I had no idea that the construct in bold/underline would work.

It seems that this has a very different meaning from the perl one.

EDIT--PS: Works in grep, too...

Last edited by pixellany; 01-08-2008 at 09:35 AM.
 
Old 01-08-2008, 10:21 AM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by new_2_unix View Post
i'm doing this with a sed script, and the regex is not working.
Sometimes regex can be a pain, if you don't understand it enough. Until you get to know it better, here's one way without regex (at least not too much)
Code:
awk 'BEGIN{ FS="</tag1>"}
{
  for ( i=1 ; i<=NF; i++ ) {
    if ( $(i) ~ /<tag1>/ ) { 
        split( $(i) , a, /<tag1>/   )
        print a[1]
    }else {
        print $(i)   
    }    
  }
}
' "file"
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
regex with sed to process file, need help on regex dwynter Linux - Newbie 5 08-31-2007 06:10 AM
grep/sed/awk - find match, then match on next line gctaylor1 Programming 3 07-11-2007 09:55 AM
Shortest lifespan of linux stable kernel? /bin/bash General 5 04-18-2006 05:44 AM
open shortest point first Z4pp4 Linux - Security 3 05-22-2002 12:00 PM
help me match this regex line (easy) JustinHoMi Programming 7 03-17-2002 02:43 AM


All times are GMT -5. The time now is 01:06 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration