how to look for the shortest match using regex, bascially the opposite of .*

new_2_unix · 01-07-2008, 01:13 PM

hi,

i'm have a problem in the following situation:

suppose, i have a string "Scrapple from the apple."

then, if i use the regular expression "a.*e", it will match: "apple from the apple", because by definition using the .* will match the longest string that will match the regular expression.

and it won't match: "apple" OR "apple from the" even though these also start with "a" and end with an "e".

my problem is that instead of looking for the longest match, i want the shortest match. i've looked at the tutorials, but am still at a loss on how to do this. any help will be much appreciated.

pixellany · 01-07-2008, 01:45 PM

Here is just one crude method (using sed):

sed 's/a.\{,3\}e/G/g' filename

This matches the pattern: "a" + a maximum of 3 characters + "e", and replaces it with "G" for all occurences.

Poetics · 01-07-2008, 01:52 PM

You want the "non-greedy" matching operators. In perl, for example, if you used .+? it will match on the first character (beware with using .*? -- it will happily match on 0 characters and end).

new_2_unix · 01-07-2008, 02:12 PM

here is an example of what i'm trying to do.
i'm trying to delete everything between and including <tag1> and </tag1>. but anything that's outside of this should not be deleted.

i'm doing this with a sed script, and the regex is not working.

[HTML]
<html><body><tag1>This is inside tag1. This should be deleted.</tag1>This is the first statement outside of tag1. This should NOT be deleted.<tag1>This is once again inside tag1. This should be deleted as well.</tag1> This is the second statement outside tag1. This should NOT be deleted.</body></html>
[/HTML]

i've tried the following:
in this one the problem is that it deletes the first line outside <tag1> as well.

Code:

$cat test1 | sed 's/<tag1>.*<\/tag1>//'

<html><body>This is the second statement outside tag1. This should not be deleted.</body></html>

in this one the problem is that it does not delete anything:

Code:

$cat test1 | sed 's/<tag1>.+?<\/tag1>//'

<html><body><tag1>This is inside tag1. This should be deleted.</tag1>This is the first statement outside tag1. This should not be deleted.<tag1>This is once again inside tag1. This should be deleted as well.</tag1>This is the second statement outside tag1. This should not be deleted.</body></html>

i've tried several variants of the regex above... but am still at a loss on how to do this... any guidance will be helpful. thanks!

Poetics · 01-07-2008, 02:27 PM

What about using "s/(^.*<tag1>).*(</tag1>.*$)/\1\2/"? That will save everything up to and including <tag1> from the start of the line, and then save everything after and including </tag1>, to the end of the line, chopping the middle. An inelegant solution, I know, but something that may work until something better comes along.

pixellany · 01-07-2008, 02:32 PM

In your first example, the regex is "greedy"--ie it goes all the way to the last instance of </tag1>.
In addition to my earlier crude solution (max # of characters), you could also do this:

sed -e 's/\/tag1/TAGONE/' -e 's/<tag1>.*<TAGONE>//' (By replacing only the first instance of "/tag1" you create an unambiguous endpoint for the second SED command.)

My favorite SED tutorial:
http://www.grymoire.com/Unix/Sed.html

Quote:

.+?

I don't know if this is legal--I have never seen in in Bash. "+" means one or more, and "?" means optional---I don't know what the combo would mean. Maybe try it on something simple.....

Poetics · 01-07-2008, 02:42 PM

Quote:

Originally Posted by pixellany

I don't know if this is legal--I have never seen in in Bash. "+" means one or more, and "?" means optional---I don't know what the combo would mean. Maybe try it on something simple.....

His question just mentioned regular expressions, context-unspecific. In perl (where I do 99% of my personal RegExp work) one can use the '?' operator to turn a greedy operator into a non-greedy one.

pixellany · 01-08-2008, 08:29 AM

This works:

Code:

sed 's/[0-9]\{2\},\+\?[0-9]\{2\}/DDD/g' filename

this finds all instances of 2 digits + optional 1 or more commas, plus 2 more digits.

From all my reading, I had no idea that the construct in bold/underline would work.

It seems that this has a very different meaning from the perl one.

EDIT--PS: Works in grep, too...

ghostdog74 · 01-08-2008, 09:21 AM

Quote:

Originally Posted by new_2_unix

i'm doing this with a sed script, and the regex is not working.

Sometimes regex can be a pain, if you don't understand it enough. Until you get to know it better, here's one way without regex (at least not too much)

Code:

awk 'BEGIN{ FS="</tag1>"}
{
  for ( i=1 ; i<=NF; i++ ) {
    if ( $(i) ~ /<tag1>/ ) { 
        split( $(i) , a, /<tag1>/   )
        print a[1]
    }else {
        print $(i)   
    }    
  }
}
' "file"