LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 03-02-2017, 05:03 AM   #1
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240

Rep: Reputation: 103Reputation: 103
regex basic laziness/greediness


Given the text: <html>Just some code</html>
Code:
<.*>
This match is obviously greedy. I understand that. It matches everything from the first < to the last >
Code:
<[^>]*>
This only matches the text within the angle brackets, namely "html" (and not "Just some code").
Yet I don't understand how it does this. What is happening exactly? Why is the match lazy and why doesn't it match the whole line up to the last ">"?
[^>]* means any number of any characters, except for the closing angle bracket ">". Given this premise, why shouldn't this also work as a greedy match, after all? I don't really see the difference.
 
Old 03-02-2017, 05:42 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,716

Rep: Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553
If I understand it correctly:
it will look for < first, next look for anything but > (any number of anything) and finally a >. So this regexp cannot go "further", will stop at the first > (after the first <)
 
1 members found this post helpful.
Old 03-02-2017, 07:15 AM   #3
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240

Original Poster
Rep: Reputation: 103Reputation: 103
Yeah, now that I thought it over and over again it might make sense, even though I was perfectly aware of what the second regex is doing. It's just a bit weird. So you're telling it to include any number of any characters, except the >, but when it eventually does encounter >, it stops. So there can't be more than one > in the match, whereas in the former example there could be any number of >.
 
Old 03-02-2017, 07:54 AM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,018

Rep: Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199
Just to clarify, it only matches a single '>' because it is the last item in the regex, what I mean is, if you had of written:
Code:
<[^>]*
The above would result in there not being any '>' at all


Also, and this is probably just nit picky, but the asterisk actually stands for :- zero or more of the preceding character (in our case, not '>')
 
1 members found this post helpful.
Old 03-02-2017, 07:56 AM   #5
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240

Original Poster
Rep: Reputation: 103Reputation: 103
Yes, you're right. I had actually omitted > at the end by mistake and saw that it didn't match it. Good point. It's easier to understand.
 
Old 03-02-2017, 08:00 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,716

Rep: Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553Reputation: 7553
I would suggest you to try https://regex101.com/ to check your regexps, there is an explanation on the right side.
 
1 members found this post helpful.
Old 03-02-2017, 06:16 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,248

Rep: Reputation: 4156Reputation: 4156Reputation: 4156Reputation: 4156Reputation: 4156Reputation: 4156Reputation: 4156Reputation: 4156Reputation: 4156Reputation: 4156Reputation: 4156
Character groups are (effectively) possessive - each character is checked against the group in turn. If it was greedy, it would always have to backtrack to the (new) beginning of the string to start the compare. Too expensive in compute cycles for no gain - my interpretation, I have no innate knowledge of the various regex engines.
It is often (very) worthwhile to use techniques that short-circuit greediness if you know in advance the likely format of the data.
 
Old 03-02-2017, 06:47 PM   #8
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,861
Blog Entries: 4

Rep: Reputation: 3995Reputation: 3995Reputation: 3995Reputation: 3995Reputation: 3995Reputation: 3995Reputation: 3995Reputation: 3995Reputation: 3995Reputation: 3995Reputation: 3995
This regex:
Code:
<[^>]*>
means: "'<', followed by zero or more occurrences of a character that is not '>', followed by '>'."

And that's it.

Although the regex is written to be "greedy" (which is the default), the nature of this particular regex dictates that it will stop at the first '>' character that it finds. Therefore, "greedy vs. non-greedy" is irrelevant in t-h-i-s case, strictly due to the nature of t-h-i-s regex.

- - -
The greedy regex ... "<.*>" would grab the leftmost "<" and the rightmost ">" since this would be the longest possible match.

The non-greedy regex ... "<.*?>" would grab the leftmost "<" and the next-thereafter-occurring ">" since this would be the shortest possible match.

Regular expressions are extremely subtle, and must be aggressively tested against actual data.

Last edited by sundialsvcs; 03-02-2017 at 07:02 PM.
 
Old 03-03-2017, 01:06 AM   #9
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240

Original Poster
Rep: Reputation: 103Reputation: 103
Yes, I've already begun to understand their subtlety. I tended to underestimate their complexity when it came to these elementary situations, as I thought that I just needed to learn them fast in order to get to more complex stuff. But they're already rather complex as they are, and I guess that's what makes them so powerful.
 
Old 03-03-2017, 04:31 AM   #10
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,397

Rep: Reputation: 2777Reputation: 2777Reputation: 2777Reputation: 2777Reputation: 2777Reputation: 2777Reputation: 2777Reputation: 2777Reputation: 2777Reputation: 2777Reputation: 2777
I'd just like to add that in fact there are many regex 'engines' and they do NOT all work the same.
I can do no better than highly recommend http://regex.info/book.html

As an example, some tools have their own regex engine built in, but also support a '-pcre' option that calls a more powerful (& different) engine, based on the Perl regex engine (pcre = perl-compatible regex engine).
Usually not the full Perl regex, but most of the capability.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Confusing issue with Perl regEx - Regex check seems to require variable being set EnderX Programming 1 09-07-2013 04:36 AM
[SOLVED] Another regex basic Q Linux_Kidd Programming 7 11-17-2011 09:05 AM
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 01:10 AM
LXer: Linus Torvalds on regression, laziness and having his code rejected LXer Syndicated Linux News 0 01-21-2009 09:11 PM
When did the world balance rub? Was it the greediness that caused it? snama General 55 07-03-2007 05:31 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 10:07 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration