Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
|
01-01-2024, 02:44 PM
|
#1
|
Senior Member
Registered: Dec 2015
Posts: 1,639
Rep:
|
[regexp] Meaning of '?' after '*'
Code:
/\<!--(.|\s)*?-->/m
Matches HTML-comments. The pattern has given me a hard time and the final version is partly copied from a code-example on the Web. I have added the multi-line option.
I see that it works with the HTML-files at my disposal, but do not understand the question-mark at this position after '*'. During my experiments, it had been my idea to match the closing '-->' greedily like in
Code:
/\<!--(.|\s)*(-->)?/m
. At least I could claim to understand the ? here (probably don't either).
When I omit the question-mark, a script tries for an eternity to match something but I have never been patient enough to wait for a result. The first pattern above makes the routine return quasi immediately and had always been successful.
Thank you for any clarification.
Last edited by Michael Uplawski; 01-01-2024 at 02:45 PM.
|
|
|
01-01-2024, 02:50 PM
|
#2
|
LQ Guru
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,911
|
You do not say which regexp engine is involved, but I presume we are talking about javascript.
The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior.
|
|
2 members found this post helpful.
|
01-01-2024, 10:34 PM
|
#3
|
Senior Member
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,924
|
Also how is `(.|\s)` different to `.` ?
|
|
|
01-02-2024, 01:06 AM
|
#4
|
Senior Member
Registered: Dec 2015
Posts: 1,639
Original Poster
Rep:
|
Quote:
Originally Posted by NevemTeve
Also how is `(.|\s)` different to `.` ?
|
It is not.
I am reminded by this code example (copied from the Web), that both are matched. Not that I had not used .* instead of (.\s)*, but as most of my trials were s... sub-optimal, all kinds of doubt and Jeffrey Friedl crossed my mind. The fact that the overly explicit notation finally *works* as I wanted it, was enough to keep me from touching the keyboard for a while.
But what does "one, more or none of the previous" gain by adding '?'. That the here quoted rule were optional does not make sense for me.
Last edited by Michael Uplawski; 01-02-2024 at 01:09 AM.
Reason: Kraut2English
|
|
|
01-02-2024, 01:08 AM
|
#5
|
Senior Member
Registered: Dec 2015
Posts: 1,639
Original Poster
Rep:
|
Quote:
Originally Posted by wpeckham
You do not say which regexp engine is involved, but I presume we are talking about javascript.
The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior.
|
I understand, but do not comprehend why this were necessary...
My script is (of course, and for the rest of my life) in Ruby.
|
|
|
01-02-2024, 01:55 AM
|
#6
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,017
|
|
|
2 members found this post helpful.
|
01-02-2024, 02:39 AM
|
#7
|
Senior Member
Registered: Dec 2015
Posts: 1,639
Original Poster
Rep:
|
LAZYNESS
Quote:
Originally Posted by grail
|
“ as few as possible”
Several ways to explain my difficulties: - Greediness occupied my mind, while it should have been “Laziness”
- I feared to miss the first closing “-->”in the comment and concentrated on this detail
- Having read the greediness/laziness chapters in Jeffrey Friedl's book much more often than I ever had need for them, my brain got muddy
Retranslated from English to Regexp to German to English:
.*? will “collect” as few as possible matches (from .*), just enough to comply to the entire rule. Thus. When, after having matched nothing (. ~= nothing) a closing --> appears, all is well. As this is not the case (there is not “nothing” between <!-- and -->), only then, another match is tried with “something” (. ~= anything). This works just as well and immediately --> is supposed to follow. It does not. And so on.
My own initial idea was to find '-->' as quickly as possible. Lookahead may be a way to achieve this, but I do not care to try it. The book is back on its shelve.
Sorry folks.
[Solved]
And thank you for helping out.
Last edited by Michael Uplawski; 01-02-2024 at 02:41 AM.
|
|
|
01-03-2024, 10:46 AM
|
#8
|
Senior Member
Registered: Dec 2011
Location: Simplicity
Posts: 2,918
|
The *? is a minimum match; the match will span to the first -->
The * is a greedy match; the match will span to the last -->
The --> is optional; the match will span until the very end. The addtitional wildcard expression might cost extra time. Effectively it is
You can say the ? is a modifier of the preceding quantifier; it modifies greedyness to mimimum.
This mimimum match is from perl/PCRE; it is not defined in ERE or BRE.
grep -P understands it; grep -o prints just the match:
Code:
echo 'bla1<!--bla2-->bla3<!--bla4-->bla5' | grep -Po '<!--.*?-->'
prints the two minimum matches, while
Code:
echo 'bla1<!--bla2-->bla3<!--bla4-->bla5' | grep -Po '<!--.*-->'
prints the one greedy match.
(With color support you can see it without the -o option. But sometimes the color support seems buggy...)
Last edited by MadeInGermany; 01-03-2024 at 11:08 AM.
|
|
|
01-04-2024, 01:05 AM
|
#9
|
Senior Member
Registered: Dec 2015
Posts: 1,639
Original Poster
Rep:
|
Quote:
Originally Posted by MadeInGermany
The * is a greedy match; the match will span to the last -->
|
That is why I wanted to avoid it by “ insisting on the very first -->” (wrong) instead of “ insisting on the last anything before -->” (right).
Ruby's engine is Onigmo, which is Oniguruma with a little Perl. Put another way, PHP with Perl. Quite PCRE, a lot Perl-like. As far as I could identify differences, they concern patterns that I do not use. Talking about them would render Ruby way more incompatible with PCRE than it ever will be for anybody, in reality.
Last edited by Michael Uplawski; 01-04-2024 at 01:06 AM.
Reason: kraut2English
|
|
|
01-04-2024, 08:47 AM
|
#10
|
Senior Member
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,708
|
Quote:
Originally Posted by NevemTeve
Also how is `(.|\s)` different to `.` ?
|
" \s" means "white space", and is generally equivalent to " [\n\t ]" (can include other whitespace characters).
On the other hand " ." means either "all characters" or "all except newline" (depending on regex engine and mode); in the latter case . is equivalent to " [^\n]"
So the expression is similar to " ([^\n]|[\n\t ])", and will result in matching all characters, but it's simpler to enable the "dot all" flag (usually "s") and just use " .*?".
In this instance, an even more efficient way to do that would be a greedy match of " [^-]" combined with a negative lookahead for the terminating pattern, e.g: " ([^-]+|-(?!->))*"
(And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.)
|
|
|
01-04-2024, 09:45 AM
|
#11
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,702
|
Just use www.regex101.com, it will be nicely explained (and you can also check how does it work).
https://regex101.com/r/uRhHob/1
Last edited by pan64; 01-04-2024 at 01:44 PM.
|
|
1 members found this post helpful.
|
01-04-2024, 04:21 PM
|
#12
|
Senior Member
Registered: Dec 2015
Posts: 1,639
Original Poster
Rep:
|
Quote:
Originally Posted by boughtonp
(And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.)
|
I am using an XML parser, but the comments are obstructive before I handle individual tags. In the program in question, I have to eliminate successive rows full of tabulators ('\t') and a lot of empty lines. I chose to do this and also to eliminate HTML comments before the actual code parsing takes place.
|
|
|
01-04-2024, 04:36 PM
|
#13
|
LQ Guru
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,911
|
XML is not HTML, or even close. A tool for one may not act in a useful way when applied to the other.
|
|
|
01-04-2024, 11:33 PM
|
#14
|
Senior Member
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,924
|
Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[
Code:
<input type="text" value="<!-- not comment -->">
<![CDATA[ <!-- not comment --> ]]>
|
|
1 members found this post helpful.
|
01-05-2024, 02:46 AM
|
#15
|
Senior Member
Registered: Dec 2015
Posts: 1,639
Original Poster
Rep:
|
Quote:
Originally Posted by NevemTeve
Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[
Code:
<input type="text" value="<!-- not comment -->">
<![CDATA[ <!-- not comment --> ]]>
|
You may doubt and it is a good thing to doubt. We are, though, not doing rocket science. You would know, if I did (everybody would).
I am using an xml-parser (... actually. Only a few would) which qualifies as a HTML-parser as well and I will not explain, why this is so natural a thing, that you will not worry, anyway. Skip this part. My program was working and *I* only had problems with following its actions in a log file that is automatically created. It had been *my idea* to clear things up, before the parser comes into play and *I* state afterwards that this was not so bad an idea (outside of rocket-science, that is).
The essence of this thread is that there are concepts which need to be *actively* kept apart from each other, because their *uses* seem so similar that it is too late, when you stumble over only one of them, seemingly apt to help you. Maybe add that examples are not superfluous when you try to understand lookahead, lazyness and creediness.
No need for XML. My fault to have mentioned it.
Last edited by Michael Uplawski; 01-05-2024 at 02:50 AM.
Reason: exitus
|
|
|
All times are GMT -5. The time now is 01:57 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|