regex basic laziness/greediness
Given the text: <html>Just some code</html>
Code:
<.*> Code:
<[^>]*> Yet I don't understand how it does this. What is happening exactly? Why is the match lazy and why doesn't it match the whole line up to the last ">"? [^>]* means any number of any characters, except for the closing angle bracket ">". Given this premise, why shouldn't this also work as a greedy match, after all? I don't really see the difference. |
If I understand it correctly:
it will look for < first, next look for anything but > (any number of anything) and finally a >. So this regexp cannot go "further", will stop at the first > (after the first <) |
Yeah, now that I thought it over and over again it might make sense, even though I was perfectly aware of what the second regex is doing. It's just a bit weird. So you're telling it to include any number of any characters, except the >, but when it eventually does encounter >, it stops. So there can't be more than one > in the match, whereas in the former example there could be any number of >.
|
Just to clarify, it only matches a single '>' because it is the last item in the regex, what I mean is, if you had of written:
Code:
<[^>]* Also, and this is probably just nit picky, but the asterisk actually stands for :- zero or more of the preceding character (in our case, not '>') |
Yes, you're right. I had actually omitted > at the end by mistake and saw that it didn't match it. Good point. It's easier to understand.
|
I would suggest you to try https://regex101.com/ to check your regexps, there is an explanation on the right side.
|
Character groups are (effectively) possessive - each character is checked against the group in turn. If it was greedy, it would always have to backtrack to the (new) beginning of the string to start the compare. Too expensive in compute cycles for no gain - my interpretation, I have no innate knowledge of the various regex engines.
It is often (very) worthwhile to use techniques that short-circuit greediness if you know in advance the likely format of the data. |
This regex:
Code:
<[^>]*> And that's it. Although the regex is written to be "greedy" (which is the default), the nature of this particular regex dictates that it will stop at the first '>' character that it finds. Therefore, "greedy vs. non-greedy" is irrelevant in t-h-i-s case, strictly due to the nature of t-h-i-s regex. - - - The greedy regex ... "<.*>" would grab the leftmost "<" and the rightmost ">" since this would be the longest possible match. The non-greedy regex ... "<.*?>" would grab the leftmost "<" and the next-thereafter-occurring ">" since this would be the shortest possible match. Regular expressions are extremely subtle, and must be aggressively tested against actual data. |
Yes, I've already begun to understand their subtlety. I tended to underestimate their complexity when it came to these elementary situations, as I thought that I just needed to learn them fast in order to get to more complex stuff. But they're already rather complex as they are, and I guess that's what makes them so powerful.
|
I'd just like to add that in fact there are many regex 'engines' and they do NOT all work the same.
I can do no better than highly recommend http://regex.info/book.html :) As an example, some tools have their own regex engine built in, but also support a '-pcre' option that calls a more powerful (& different) engine, based on the Perl regex engine (pcre = perl-compatible regex engine). Usually not the full Perl regex, but most of the capability. |
All times are GMT -5. The time now is 05:22 AM. |