Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
This match is obviously greedy. I understand that. It matches everything from the first < to the last >
Code:
<[^>]*>
This only matches the text within the angle brackets, namely "html" (and not "Just some code").
Yet I don't understand how it does this. What is happening exactly? Why is the match lazy and why doesn't it match the whole line up to the last ">"?
[^>]* means any number of any characters, except for the closing angle bracket ">". Given this premise, why shouldn't this also work as a greedy match, after all? I don't really see the difference.
If I understand it correctly:
it will look for < first, next look for anything but > (any number of anything) and finally a >. So this regexp cannot go "further", will stop at the first > (after the first <)
Yeah, now that I thought it over and over again it might make sense, even though I was perfectly aware of what the second regex is doing. It's just a bit weird. So you're telling it to include any number of any characters, except the >, but when it eventually does encounter >, it stops. So there can't be more than one > in the match, whereas in the former example there could be any number of >.
Character groups are (effectively) possessive - each character is checked against the group in turn. If it was greedy, it would always have to backtrack to the (new) beginning of the string to start the compare. Too expensive in compute cycles for no gain - my interpretation, I have no innate knowledge of the various regex engines.
It is often (very) worthwhile to use techniques that short-circuit greediness if you know in advance the likely format of the data.
means: "'<', followed by zero or more occurrences of a character that is not '>', followed by '>'."
And that's it.
Although the regex is written to be "greedy" (which is the default), the nature of this particular regex dictates that it will stop at the first '>' character that it finds. Therefore, "greedy vs. non-greedy" is irrelevant in t-h-i-s case, strictly due to the nature of t-h-i-s regex.
- - -
The greedy regex ... "<.*>" would grab the leftmost "<" and the rightmost ">" since this would be the longest possible match.
The non-greedy regex ... "<.*?>" would grab the leftmost "<" and the next-thereafter-occurring ">" since this would be the shortest possible match.
Regular expressions are extremely subtle, and must be aggressively tested against actual data.
Last edited by sundialsvcs; 03-02-2017 at 07:02 PM.
Yes, I've already begun to understand their subtlety. I tended to underestimate their complexity when it came to these elementary situations, as I thought that I just needed to learn them fast in order to get to more complex stuff. But they're already rather complex as they are, and I guess that's what makes them so powerful.
I'd just like to add that in fact there are many regex 'engines' and they do NOT all work the same.
I can do no better than highly recommend http://regex.info/book.html
As an example, some tools have their own regex engine built in, but also support a '-pcre' option that calls a more powerful (& different) engine, based on the Perl regex engine (pcre = perl-compatible regex engine).
Usually not the full Perl regex, but most of the capability.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.