LinuxQuestions.org - how to look for the shortest match using regex, bascially the opposite of .*

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - how to look for the shortest match using regex, bascially the opposite of .* (https://www.linuxquestions.org/questions/linux-newbie-8/how-to-look-for-the-shortest-match-using-regex-bascially-the-opposite-of-%2A-611801/)

how to look for the shortest match using regex, bascially the opposite of .*

hi,

i'm have a problem in the following situation:

suppose, i have a string "Scrapple from the apple."

then, if i use the regular expression "a.*e", it will match: "apple from the apple", because by definition using the .* will match the longest string that will match the regular expression.

and it won't match: "apple" OR "apple from the" even though these also start with "a" and end with an "e".

my problem is that instead of looking for the longest match, i want the shortest match. i've looked at the tutorials, but am still at a loss on how to do this. any help will be much appreciated.

Here is just one crude method (using sed):

sed 's/a.\{,3\}e/G/g' filename

This matches the pattern: "a" + a maximum of 3 characters + "e", and replaces it with "G" for all occurences.

You want the "non-greedy" matching operators. In perl, for example, if you used .+? it will match on the first character (beware with using .*? -- it will happily match on 0 characters and end).

here is an example of what i'm trying to do.
i'm trying to delete everything between and including <tag1> and </tag1>. but anything that's outside of this should not be deleted.

i'm doing this with a sed script, and the regex is not working.

[HTML]
<html><body><tag1>This is inside tag1. This should be deleted.</tag1>This is the first statement outside of tag1. This should NOT be deleted.<tag1>This is once again inside tag1. This should be deleted as well.</tag1> This is the second statement outside tag1. This should NOT be deleted.</body></html>
[/HTML]

i've tried the following:
in this one the problem is that it deletes the first line outside <tag1> as well.

Code:

$cat test1 | sed 's/<tag1>.*<\/tag1>//'



<html><body>This is the second statement outside tag1. This should not be deleted.</body></html>

in this one the problem is that it does not delete anything:

Code:

$cat test1 | sed 's/<tag1>.+?<\/tag1>//'



<html><body><tag1>This is inside tag1. This should be deleted.</tag1>This is the first statement outside tag1. This should not be deleted.<tag1>This is once again inside tag1. This should be deleted as well.</tag1>This is the second statement outside tag1. This should not be deleted.</body></html>

i've tried several variants of the regex above... but am still at a loss on how to do this... any guidance will be helpful. thanks!

What about using "s/(^.*<tag1>).*(</tag1>.*$)/\1\2/"? That will save everything up to and including <tag1> from the start of the line, and then save everything after and including </tag1>, to the end of the line, chopping the middle. An inelegant solution, I know, but something that may work until something better comes along.

In your first example, the regex is "greedy"--ie it goes all the way to the last instance of </tag1>.
In addition to my earlier crude solution (max # of characters), you could also do this:

sed -e 's/\/tag1/TAGONE/' -e 's/<tag1>.*<TAGONE>//' (By replacing only the first instance of "/tag1" you create an unambiguous endpoint for the second SED command.)

My favorite SED tutorial:
http://www.grymoire.com/Unix/Sed.html

Quote:

.+?

I don't know if this is legal--I have never seen in in Bash. "+" means one or more, and "?" means optional---I don't know what the combo would mean. Maybe try it on something simple.....

Quote:

Originally Posted by pixellany (Post 3014714)

I don't know if this is legal--I have never seen in in Bash. "+" means one or more, and "?" means optional---I don't know what the combo would mean. Maybe try it on something simple.....

His question just mentioned regular expressions, context-unspecific. In perl (where I do 99% of my personal RegExp work) one can use the '?' operator to turn a greedy operator into a non-greedy one.

This works:

Code:

sed 's/[0-9]\{2\},\+\?[0-9]\{2\}/DDD/g' filename

this finds all instances of 2 digits + optional 1 or more commas, plus 2 more digits.

From all my reading, I had no idea that the construct in bold/underline would work.

It seems that this has a very different meaning from the perl one.

EDIT--PS: Works in grep, too...

Quote:

Originally Posted by new_2_unix (Post 3014695)

i'm doing this with a sed script, and the regex is not working.

Sometimes regex can be a pain, if you don't understand it enough. Until you get to know it better, here's one way without regex (at least not too much)

Code:

awk 'BEGIN{ FS="</tag1>"}

{

  for ( i=1 ; i<=NF; i++ ) {

    if ( $(i) ~ /<tag1>/ ) { 

        split( $(i) , a, /<tag1>/  )

        print a[1]

    }else {

        print $(i)  

    }    

  }

}

' "file"