[regex] when would you prefer capture groups or String tokenizers?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
[regex] when would you prefer capture groups or String tokenizers?
Good afternoon.
I revisit Jeffrey Friedl's great book on Regular Expressions. Each time I wonder if I should learn Perl, just for the fun of it. And each time, I know that I did enough programming without Perl and fear the steep learning curve.
But remembering my past solutions in code, when Stings had to be matched, split up into pieces, analyzed in any way, I have to admit that I avoided Regular Expressions, if I could do the same thing with a simple tokenizer. When you can define a delimiter, many string-functions and -methods let you split-up and compare strings by fraction, and you will not need to know much about Regular Expressions, even if these functions and methods often accept a Regulalr Expression as parameter.
Would you formulate a rule or just present an experience which talks about giving precedence to one over the other?
I shall provide code-examples...
Code:
hulk@hogan:~$irb
irb(main):009:0> "hey, there is a 50€-note lying on the table".scan(/.*,/)
=> ["hey,"]
irb(main):010:0> "hey, there is a 50€-note lying on the table".scan(/\d+/)
=> ["50"]
irb(main):031:0> str = "a223abb233b".match(/(\d+)a/)[1]
=> "223"
irb(main):022:0> /a+(\d+)a.*(\1)b/.match "aaa233a234a233b"
=> #<MatchData "aaa233a234a233b" 1:"233" 2:"233">
Edit: Terribly dumb and not really correct in C:
Code:
#include <string.h>
#include <stdio.h>
int main(int argc, char** argv) {
char* str = "";
char* where = "";
str = "Hey, there is nothing lying on the table";
where = strstr(str, "nothing");
printf("%s\n", where);
return 0;
}
Do not take this as an example for anything. I leave it here for authenicity.
Last edited by Michael Uplawski; 02-14-2021 at 03:19 PM.
Reason: now with actual capture group... not back-reference.
[regex] when would you prefer back-references or String tokenizers?
Uh? Your title asks about back-references, your question text makes no further references, or even allusions, to back-references.
Since it's not clear what you're actually asking, I've no idea if any of the following is what you're after or not...
Quote:
Originally Posted by Michael Uplawski
I revisit Jeffrey Friedl's great book on Regular Expressions. Each time I wonder if I should learn Perl, just for the fun of it.
Don't conflate "Perl" and "regex".
Sure Perl has a powerful and flexible regex engine, but there is far more to the language than that, and I'm sure it's possible to use/learn it without regex, (if for some reason one wanted to).
It is certainly possible to learn regex without touching Perl.
Quote:
But remembering my past solutions in code, when Stings had to be matched, split up into pieces, analyzed in any way, I have to admit that I avoided Regular Expressions, if I could do the same thing with a simple tokenizer.
If there's a simple direct solution, use it.
I would never choose to write ".*," to match the first word of a sentence - the first choice would be a function that delimited the string with commas, e.g. "ListFirst(string)" or "string.split(',',2)[0]" (or similar), and the second would be patterns like "[^,]*," or "\w+(?=,)" depending on the specific need.
Quote:
you will not need to know much about Regular Expressions, even if these functions and methods often accept a Regulalr Expression as parameter.
There is not much you can do with regex that can't also been done another way. That doesn't mean regex isn't an incredibly useful tool when applied appropriately.
Quote:
Would you formulate a rule or just present an experience which talks about giving precedence to one over the other?
Same way you choose what to eat for dinner - a combination of experience, preference, and available options.
When you know regex, it can be a quick and concise way to describe text you want to do something with.
Even when I intend to use a proper parser for some data, it can be a great 90% solution for the first draft that lets me focus on the bulk of the code, and come back to the precise format later.
And it's generally a lot quicker to use tools like awk/grep/sed than doing adhoc string tokenization.
Uh? Your title asks about back-references, your question text makes no further references, or even allusions, to back-references.
Since it's not clear what you're actually asking, I've no idea if any of the following is what you're after or not...
I confused two expressions, “back-references” and “capture groups”. Although related, my question was more about the latter, as you have noted, it seems. Before “the book”, I did not know these existed, although I had seen the syntax, earlier.
Btw. There is a back-reference in my last Ruby-example.
Quote:
I would never choose to write ".*," to match the first word of a sentence - the first choice would be a function that delimited the string with commas, e.g. "ListFirst(string)" or "string.split(',',2)[0]" (or similar), and the second would be patterns like "[^,]*," or "\w+(?=,)" depending on the specific need.
Yes, that is how I rather use String.split[<delimeter>] than a regex. Now I wonder in how many cases the following actions could be facilitated if a pertinent regex permits the direct access to any of the created tokens. It certainly depends on the needs and the objective of the exercise. On the downside, the more a regex is capable to do, the less it may be maintainable. I do not know this for sure, for lack of experience.
Quote:
There is not much you can do with regex that can't also been done another way.
My insecurity originates from this fact, I guess.
Quote:
When you know regex, it can be a quick and concise way to describe text you want to do something with.
When working in a team, does it not increase the need for communication? I like to comment my code, but describing a regex to be fully understood by everybody is certainly less fun.
Quote:
Even when I intend to use a proper parser for some data, it can be a great 90% solution for the first draft that lets me focus on the bulk of the code, and come back to the precise format later.
Voilà. This is a thing to consider and to keep in mind. “Prototyping” with regex should even allow to highlight possible pitfalls and things to keep an eye on, when you devise a simpler tokenizer..!?
Quote:
And it's generally a lot quicker to use tools like awk/grep/sed than doing adhoc string tokenization.
With an emphasis on “adhoc”.
See? This is not all futile.
I just cannot do all the coding and do not have enough ideas to cover all these aspects, let alone the experience.
TY.
Michael
Last edited by Michael Uplawski; 02-14-2021 at 11:11 AM.
Reason: Kraut2English
I confused two expressions, “back-references” and “capture groups”. Although related, my question was more about the latter, as you have noted, it seems. Before “the book”, I did not know these existed, although I had seen the syntax, earlier.
Btw. There is a back-reference in my last Ruby-example.
Ah, as is evident, I didn't really pay a great of attention to the examples. (I was going to comment on each one, then ...didn't, for some reason.)
I think I understand the angle you're coming from now, and hopefully the rest of this post answers it better.
Quote:
Now I wonder in how many cases the following actions could be facilitated if a pertinent regex permits the direct access to any of the created tokens.
I'm not sure I catch what you mean with "following actions", but this is what capture groups do - they capture the text, and can be used via in-pattern back-references, replacement string back-references, or output as part of the match data (in languages with such functionality).
Modern regex implementations (e.g. Perl/Python/Java) allow named capture groups as well as the traditional positional ones, and getting an array of matches each with named key/value pairs for the groups can be a nice way to tokenize. There is no problem maintaining when formatted sensibly...
Quote:
When working in a team, does it not increase the need for communication?
The less experienced a team (in any technology) the greater the need for communication, and there's nothing wrong with that - how else would anyone learn.
Quote:
I like to comment my code, but describing a regex to be fully understood by everybody is certainly less fun.
It sounds like you're saying you like to waste time by repeating yourself?
Comments should be used to explain code that cannot be readily understood by a proficient developer reading through the code (and should generally focus on why not what).
Inexperienced developers should not be relying on comments; when they encounter code they don't understand, that's their opportunity to learn.
At the same time, it's important to remember that most regex implementations do not require everything compacted into dense single-line strings that regex is sometimes infamous for, and have a comment mode flag that ignores unescaped whitespace and allows "#" to start comments.
Modern regex implementations (e.g. Perl/Python/Java) allow named capture groups as well as the traditional positional ones, and getting an array of matches each with named key/value pairs for the groups can be a nice way to tokenize. There is no problem maintaining when formatted sensibly...
I have never used more than 2 capture groups, I believe. And my regex are mostly for single-shot matches, but I see the charm of named capture groups.
The test of email addresses is a nice example. May I ask if the non-capturing groups in this precise case really make a difference? Maybe it depends on the number of mail-addresses which are tested.
Also, the use of the comment-mode renders the code comparable to that of any other kind of tokenizer. Meaning that the amount of detail, covered by the regex, becomes an obvious advantage if the targeted data is relatively unspecific.
May I ask if the non-capturing groups in this precise case really make a difference? Maybe it depends on the number of mail-addresses which are tested.
You mean in performance terms?
I'm not sure I've ever measured it, but to me it's mostly an indication of intent - I'm grouping this but explicitly don't care about its value.
Thinking about it, I'm getting curious - since a regex engine needs to store backtracking information, it's going to have start position and length of every unit/atom already, so the difference between an unused capture group and a non-capturing group might effectively only be an extra int/ID being assigned to that particular section, or perhaps even only a boolean with counting done on retrieval.
If the capture group causes text to be internally duplicated then that could start adding up, given long enough text and/or enough matches, but even then - compared to the bloat of modern software - it'd likely take a lot before it became significant.
You've made me want to go investigate how different regex engines do it and see if there are any meaningful differences between them.
This is what he has to say about backreferences though:
Quote:
Backreferences. As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. (Specifically, the problem is NP-complete, meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.) The simplest, most effective strategy for backreferences, taken by the original awk and egrep, is not to implement them. This strategy is no longer practical: users have come to rely on backreferences for at least occasional use, and backreferences are part of the POSIX standard for regular expressions. Even so, it would be reasonable to use Thompson's NFA simulation for most regular expressions, and only bring out backtracking when it is needed. A particularly clever implementation could combine the two, resorting to backtracking only to accommodate the backreferences.
At least, his own RE2 library implements capturing groups.
On non-capturing groups, Jeffrey Friedl writes that either they are pratical in that they avoid global variables being “used up” (my words) for uninteresting data or they are, in fact and as boughtonp already sais, signalling intent. The gain in efficiency should depend on the amount of data, munged or the number of times a regex is applied in a loop or similar.
I am not sure about the gain in readability where capturing groups are mixed with non-capturing ones.., maybe in comment-mode. In my opinion they do, though, signal well the fact that a value is not used in the later analysis and will do so when you return much later to adapt your code to new requirements.
I will find out how qutebrowser integrates spell-checking; bear with my English for now.
Cheerio.
Last edited by Michael Uplawski; 02-16-2021 at 12:20 AM.
Reason: ... bear. Wow. Really ... Darn.
You've made me want to go investigate how different regex engines do it and see if there are any meaningful differences between them.
I am not in a position to contribute wise words to this endeavor, but just add that some tools switch between DFA and NFA as needed and guess that it can apply to PCREs, too. It is therefore really important to evaluate *engines* and *then* to see what is under the hood of any tool.
You only look at NFAs, I know. However, if there are results, keep them as talkative as possible. You never know when this thread on LQ comes up in a search result and who will use it in an argument.
Last edited by Michael Uplawski; 02-16-2021 at 06:45 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.