LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-01-2024, 02:44 PM   #1
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,639
Blog Entries: 40

Rep: Reputation: Disabled
[regexp] Meaning of '?' after '*'


Code:
/\<!--(.|\s)*?-->/m
Matches HTML-comments. The pattern has given me a hard time and the final version is partly copied from a code-example on the Web. I have added the multi-line option.

I see that it works with the HTML-files at my disposal, but do not understand the question-mark at this position after '*'. During my experiments, it had been my idea to match the closing '-->' greedily like in
Code:
/\<!--(.|\s)*(-->)?/m
. At least I could claim to understand the ? here (probably don't either).

When I omit the question-mark, a script tries for an eternity to match something but I have never been patient enough to wait for a result. The first pattern above makes the routine return quasi immediately and had always been successful.

Thank you for any clarification.

Last edited by Michael Uplawski; 01-01-2024 at 02:45 PM.
 
Old 01-01-2024, 02:50 PM   #2
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,922

Rep: Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816
You do not say which regexp engine is involved, but I presume we are talking about javascript.

The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior.
 
2 members found this post helpful.
Old 01-01-2024, 10:34 PM   #3
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,928
Blog Entries: 1

Rep: Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891
Also how is `(.|\s)` different to `.` ?
 
Old 01-02-2024, 01:06 AM   #4
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,639

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Quote:
Originally Posted by NevemTeve View Post
Also how is `(.|\s)` different to `.` ?
It is not.
I am reminded by this code example (copied from the Web), that both are matched. Not that I had not used .* instead of (.\s)*, but as most of my trials were s... sub-optimal, all kinds of doubt and Jeffrey Friedl crossed my mind. The fact that the overly explicit notation finally *works* as I wanted it, was enough to keep me from touching the keyboard for a while.

But what does "one, more or none of the previous" gain by adding '?'. That the here quoted rule were optional does not make sense for me.

Last edited by Michael Uplawski; 01-02-2024 at 01:09 AM. Reason: Kraut2English
 
Old 01-02-2024, 01:08 AM   #5
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,639

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Quote:
Originally Posted by wpeckham View Post
You do not say which regexp engine is involved, but I presume we are talking about javascript.

The documentation suggests that the ? is int he expression to render the preceding * character the non-greedy version instead of the default greedy behavior.
I understand, but do not comprehend why this were necessary...
My script is (of course, and for the rest of my life) in Ruby.
 
Old 01-02-2024, 01:55 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,018

Rep: Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199Reputation: 3199
Try this

https://stackoverflow.com/questions/...-a-phrase-left
 
2 members found this post helpful.
Old 01-02-2024, 02:39 AM   #7
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,639

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
LAZYNESS

Quote:
Originally Posted by grail View Post
as few as possible

Several ways to explain my difficulties:
  • Greediness occupied my mind, while it should have been “Laziness”
  • I feared to miss the first closing “-->”in the comment and concentrated on this detail
  • Having read the greediness/laziness chapters in Jeffrey Friedl's book much more often than I ever had need for them, my brain got muddy

Retranslated from English to Regexp to German to English:
.*? will “collect” as few as possible matches (from .*), just enough to comply to the entire rule. Thus. When, after having matched nothing (. ~= nothing) a closing --> appears, all is well. As this is not the case (there is not “nothing” between <!-- and -->), only then, another match is tried with “something” (. ~= anything). This works just as well and immediately --> is supposed to follow. It does not. And so on.

My own initial idea was to find '-->' as quickly as possible. Lookahead may be a way to achieve this, but I do not care to try it. The book is back on its shelve.

Sorry folks.
[Solved]
And thank you for helping out.

Last edited by Michael Uplawski; 01-02-2024 at 02:41 AM.
 
Old 01-03-2024, 10:46 AM   #8
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Distribution: Mint/MATE
Posts: 2,927

Rep: Reputation: 1246Reputation: 1246Reputation: 1246Reputation: 1246Reputation: 1246Reputation: 1246Reputation: 1246Reputation: 1246Reputation: 1246
Code:
/<!--.*?-->/m
The *? is a minimum match; the match will span to the first -->
Code:
/<!--.*-->/m
The * is a greedy match; the match will span to the last -->
Code:
/<!--.*(-->)?/m
The --> is optional; the match will span until the very end. The addtitional wildcard expression might cost extra time. Effectively it is
Code:
/<!--.*/m
You can say the ? is a modifier of the preceding quantifier; it modifies greedyness to mimimum.
This mimimum match is from perl/PCRE; it is not defined in ERE or BRE.
grep -P understands it; grep -o prints just the match:
Code:
echo 'bla1<!--bla2-->bla3<!--bla4-->bla5' | grep -Po '<!--.*?-->'
prints the two minimum matches, while
Code:
echo 'bla1<!--bla2-->bla3<!--bla4-->bla5' | grep -Po '<!--.*-->'
prints the one greedy match.
(With color support you can see it without the -o option. But sometimes the color support seems buggy...)

Last edited by MadeInGermany; 01-03-2024 at 11:08 AM.
 
Old 01-04-2024, 01:05 AM   #9
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,639

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Quote:
Originally Posted by MadeInGermany View Post
Code:
/<!--.*-->/m
The * is a greedy match; the match will span to the last -->
That is why I wanted to avoid it by “insisting on the very first -->” (wrong) instead of “insisting on the last anything before -->” (right).

Ruby's engine is Onigmo, which is Oniguruma with a little Perl. Put another way, PHP with Perl. Quite PCRE, a lot Perl-like. As far as I could identify differences, they concern patterns that I do not use. Talking about them would render Ruby way more incompatible with PCRE than it ever will be for anybody, in reality.

Last edited by Michael Uplawski; 01-04-2024 at 01:06 AM. Reason: kraut2English
 
Old 01-04-2024, 08:47 AM   #10
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,723

Rep: Reputation: 2631Reputation: 2631Reputation: 2631Reputation: 2631Reputation: 2631Reputation: 2631Reputation: 2631Reputation: 2631Reputation: 2631Reputation: 2631Reputation: 2631
Quote:
Originally Posted by NevemTeve View Post
Also how is `(.|\s)` different to `.` ?
"\s" means "whitespace", and is generally equivalent to "[\n\t ]" (can include other whitespace characters).

On the other hand "." means either "all characters" or "all except newline" (depending on regex engine and mode); in the latter case . is equivalent to "[^\n]"

So the expression is similar to "([^\n]|[\n\t ])", and will result in matching all characters, but it's simpler to enable the "dot all" flag (usually "s") and just use ".*?".

In this instance, an even more efficient way to do that would be a greedy match of "[^-]" combined with a negative lookahead for the terminating pattern, e.g: "([^-]+|-(?!->))*"

(And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.)

 
Old 01-04-2024, 09:45 AM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,733

Rep: Reputation: 7556Reputation: 7556Reputation: 7556Reputation: 7556Reputation: 7556Reputation: 7556Reputation: 7556Reputation: 7556Reputation: 7556Reputation: 7556Reputation: 7556
Just use www.regex101.com, it will be nicely explained (and you can also check how does it work).
https://regex101.com/r/uRhHob/1

Last edited by pan64; 01-04-2024 at 01:44 PM.
 
1 members found this post helpful.
Old 01-04-2024, 04:21 PM   #12
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,639

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Quote:
Originally Posted by boughtonp View Post
(And of course one should be wary of parsing HTML with regex, and generally prefer to use an existing, well-tested HTML parser instead.)
I am using an XML parser, but the comments are obstructive before I handle individual tags. In the program in question, I have to eliminate successive rows full of tabulators ('\t') and a lot of empty lines. I chose to do this and also to eliminate HTML comments before the actual code parsing takes place.
 
Old 01-04-2024, 04:36 PM   #13
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,922

Rep: Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816Reputation: 2816
XML is not HTML, or even close. A tool for one may not act in a useful way when applied to the other.
 
Old 01-04-2024, 11:33 PM   #14
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,928
Blog Entries: 1

Rep: Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891Reputation: 1891
Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[
Code:
<input type="text" value="<!-- not comment -->">
<![CDATA[ <!-- not comment --> ]]>
 
1 members found this post helpful.
Old 01-05-2024, 02:46 AM   #15
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,639

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Quote:
Originally Posted by NevemTeve View Post
Off: XML is the poor man's SGML. Neither of those is meant to be parsed with regular expressions, e.g. what seems to be a 'comment' might actually be inside an attribute or a CDATA[
Code:
<input type="text" value="<!-- not comment -->">
<![CDATA[ <!-- not comment --> ]]>
You may doubt and it is a good thing to doubt. We are, though, not doing rocket science. You would know, if I did (everybody would).

I am using an xml-parser (... actually. Only a few would) which qualifies as a HTML-parser as well and I will not explain, why this is so natural a thing, that you will not worry, anyway. Skip this part. My program was working and *I* only had problems with following its actions in a log file that is automatically created. It had been *my idea* to clear things up, before the parser comes into play and *I* state afterwards that this was not so bad an idea (outside of rocket-science, that is).

The essence of this thread is that there are concepts which need to be *actively* kept apart from each other, because their *uses* seem so similar that it is too late, when you stumble over only one of them, seemingly apt to help you. Maybe add that examples are not superfluous when you try to understand lookahead, lazyness and creediness.

No need for XML. My fault to have mentioned it.

Last edited by Michael Uplawski; 01-05-2024 at 02:50 AM. Reason: exitus
 
  


Reply

Tags
regexp syntax


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
perl: how to insert numerical digit immediately after regexp backreference variable? chadwick Programming 8 05-19-2008 12:49 PM
SED, regexp or such - remove text after space aolong Linux - General 5 03-07-2008 02:36 PM
regexp question rytrom Linux - Newbie 3 09-01-2003 12:50 PM
validating a surname - regexp fu chr15t0 Programming 2 06-20-2003 05:55 AM
Regexp stumper lackluster Programming 2 11-02-2002 12:31 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:54 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration