LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   regex basic laziness/greediness (https://www.linuxquestions.org/questions/linux-newbie-8/regex-basic-laziness-greediness-4175600900/)

vincix 03-02-2017 05:03 AM

regex basic laziness/greediness
 
Given the text: <html>Just some code</html>
Code:

<.*>
This match is obviously greedy. I understand that. It matches everything from the first < to the last >
Code:

<[^>]*>
This only matches the text within the angle brackets, namely "html" (and not "Just some code").
Yet I don't understand how it does this. What is happening exactly? Why is the match lazy and why doesn't it match the whole line up to the last ">"?
[^>]* means any number of any characters, except for the closing angle bracket ">". Given this premise, why shouldn't this also work as a greedy match, after all? I don't really see the difference.

pan64 03-02-2017 05:42 AM

If I understand it correctly:
it will look for < first, next look for anything but > (any number of anything) and finally a >. So this regexp cannot go "further", will stop at the first > (after the first <)

vincix 03-02-2017 07:15 AM

Yeah, now that I thought it over and over again it might make sense, even though I was perfectly aware of what the second regex is doing. It's just a bit weird. So you're telling it to include any number of any characters, except the >, but when it eventually does encounter >, it stops. So there can't be more than one > in the match, whereas in the former example there could be any number of >.

grail 03-02-2017 07:54 AM

Just to clarify, it only matches a single '>' because it is the last item in the regex, what I mean is, if you had of written:
Code:

<[^>]*
The above would result in there not being any '>' at all


Also, and this is probably just nit picky, but the asterisk actually stands for :- zero or more of the preceding character (in our case, not '>')

vincix 03-02-2017 07:56 AM

Yes, you're right. I had actually omitted > at the end by mistake and saw that it didn't match it. Good point. It's easier to understand.

pan64 03-02-2017 08:00 AM

I would suggest you to try https://regex101.com/ to check your regexps, there is an explanation on the right side.

syg00 03-02-2017 06:16 PM

Character groups are (effectively) possessive - each character is checked against the group in turn. If it was greedy, it would always have to backtrack to the (new) beginning of the string to start the compare. Too expensive in compute cycles for no gain - my interpretation, I have no innate knowledge of the various regex engines.
It is often (very) worthwhile to use techniques that short-circuit greediness if you know in advance the likely format of the data.

sundialsvcs 03-02-2017 06:47 PM

This regex:
Code:

<[^>]*>
means: "'<', followed by zero or more occurrences of a character that is not '>', followed by '>'."

And that's it.

Although the regex is written to be "greedy" (which is the default), the nature of this particular regex dictates that it will stop at the first '>' character that it finds. Therefore, "greedy vs. non-greedy" is irrelevant in t-h-i-s case, strictly due to the nature of t-h-i-s regex.

- - -
The greedy regex ... "<.*>" would grab the leftmost "<" and the rightmost ">" since this would be the longest possible match.

The non-greedy regex ... "<.*?>" would grab the leftmost "<" and the next-thereafter-occurring ">" since this would be the shortest possible match.

Regular expressions are extremely subtle, and must be aggressively tested against actual data.

vincix 03-03-2017 01:06 AM

Yes, I've already begun to understand their subtlety. I tended to underestimate their complexity when it came to these elementary situations, as I thought that I just needed to learn them fast in order to get to more complex stuff. But they're already rather complex as they are, and I guess that's what makes them so powerful.

chrism01 03-03-2017 04:31 AM

I'd just like to add that in fact there are many regex 'engines' and they do NOT all work the same.
I can do no better than highly recommend http://regex.info/book.html :)

As an example, some tools have their own regex engine built in, but also support a '-pcre' option that calls a more powerful (& different) engine, based on the Perl regex engine (pcre = perl-compatible regex engine).
Usually not the full Perl regex, but most of the capability.


All times are GMT -5. The time now is 05:22 AM.