Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
03-02-2017, 05:03 AM
|
#1
|
Senior Member
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240
Rep:
|
regex basic laziness/greediness
Given the text: <html>Just some code</html>
This match is obviously greedy. I understand that. It matches everything from the first < to the last >
This only matches the text within the angle brackets, namely "html" (and not "Just some code").
Yet I don't understand how it does this. What is happening exactly? Why is the match lazy and why doesn't it match the whole line up to the last ">"?
[^>]* means any number of any characters, except for the closing angle bracket ">". Given this premise, why shouldn't this also work as a greedy match, after all? I don't really see the difference.
|
|
|
03-02-2017, 05:42 AM
|
#2
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,716
|
If I understand it correctly:
it will look for < first, next look for anything but > (any number of anything) and finally a >. So this regexp cannot go "further", will stop at the first > (after the first <)
|
|
1 members found this post helpful.
|
03-02-2017, 07:15 AM
|
#3
|
Senior Member
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240
Original Poster
Rep:
|
Yeah, now that I thought it over and over again it might make sense, even though I was perfectly aware of what the second regex is doing. It's just a bit weird. So you're telling it to include any number of any characters, except the >, but when it eventually does encounter >, it stops. So there can't be more than one > in the match, whereas in the former example there could be any number of >.
|
|
|
03-02-2017, 07:54 AM
|
#4
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,018
|
Just to clarify, it only matches a single '>' because it is the last item in the regex, what I mean is, if you had of written:
The above would result in there not being any '>' at all
Also, and this is probably just nit picky, but the asterisk actually stands for :- zero or more of the preceding character (in our case, not '>')
|
|
1 members found this post helpful.
|
03-02-2017, 07:56 AM
|
#5
|
Senior Member
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240
Original Poster
Rep:
|
Yes, you're right. I had actually omitted > at the end by mistake and saw that it didn't match it. Good point. It's easier to understand.
|
|
|
03-02-2017, 08:00 AM
|
#6
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,716
|
I would suggest you to try https://regex101.com/ to check your regexps, there is an explanation on the right side.
|
|
1 members found this post helpful.
|
03-02-2017, 06:16 PM
|
#7
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,248
|
Character groups are (effectively) possessive - each character is checked against the group in turn. If it was greedy, it would always have to backtrack to the (new) beginning of the string to start the compare. Too expensive in compute cycles for no gain - my interpretation, I have no innate knowledge of the various regex engines.
It is often (very) worthwhile to use techniques that short-circuit greediness if you know in advance the likely format of the data.
|
|
|
03-02-2017, 06:47 PM
|
#8
|
LQ Guru
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,861
|
This regex: means: "'<', followed by zero or more occurrences of a character that is not '>', followed by '>'."
And that's it.
Although the regex is written to be "greedy" (which is the default), the nature of this particular regex dictates that it will stop at the first '>' character that it finds. Therefore, "greedy vs. non-greedy" is irrelevant in t-h-i-s case, strictly due to the nature of t-h-i-s regex.
- - -
The greedy regex ... " <.*>" would grab the leftmost " <" and the rightmost " >" since this would be the longest possible match.
The non-greedy regex ... " <.*?>" would grab the leftmost " <" and the next-thereafter-occurring " >" since this would be the shortest possible match.
Regular expressions are extremely subtle, and must be aggressively tested against actual data.
Last edited by sundialsvcs; 03-02-2017 at 07:02 PM.
|
|
|
03-03-2017, 01:06 AM
|
#9
|
Senior Member
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240
Original Poster
Rep:
|
Yes, I've already begun to understand their subtlety. I tended to underestimate their complexity when it came to these elementary situations, as I thought that I just needed to learn them fast in order to get to more complex stuff. But they're already rather complex as they are, and I guess that's what makes them so powerful.
|
|
|
03-03-2017, 04:31 AM
|
#10
|
LQ Guru
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,397
|
I'd just like to add that in fact there are many regex 'engines' and they do NOT all work the same.
I can do no better than highly recommend http://regex.info/book.html
As an example, some tools have their own regex engine built in, but also support a '-pcre' option that calls a more powerful (& different) engine, based on the Perl regex engine (pcre = perl-compatible regex engine).
Usually not the full Perl regex, but most of the capability.
|
|
|
All times are GMT -5. The time now is 10:07 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|