[regex] when would you prefer capture groups or String tokenizers?

Michael Uplawski · 02-14-2021, 05:18 AM

Good afternoon.

I revisit Jeffrey Friedl's great book on Regular Expressions. Each time I wonder if I should learn Perl, just for the fun of it. And each time, I know that I did enough programming without Perl and fear the steep learning curve.

But remembering my past solutions in code, when Stings had to be matched, split up into pieces, analyzed in any way, I have to admit that I avoided Regular Expressions, if I could do the same thing with a simple tokenizer. When you can define a delimiter, many string-functions and -methods let you split-up and compare strings by fraction, and you will not need to know much about Regular Expressions, even if these functions and methods often accept a Regulalr Expression as parameter.

Would you formulate a rule or just present an experience which talks about giving precedence to one over the other?

I shall provide code-examples...

Code:

hulk@hogan:~$irb
irb(main):009:0> "hey, there is a 50€-note lying on the table".scan(/.*,/)
=> ["hey,"]
irb(main):010:0> "hey, there is a 50€-note lying on the table".scan(/\d+/)
=> ["50"]
irb(main):031:0> str = "a223abb233b".match(/(\d+)a/)[1]
=> "223"
irb(main):022:0> /a+(\d+)a.*(\1)b/.match "aaa233a234a233b"
=> #<MatchData "aaa233a234a233b" 1:"233" 2:"233">

Edit: Terribly dumb and not really correct in C:

Code:

#include <string.h>
#include <stdio.h>
int main(int argc, char** argv) {
  char* str = ""; 
  char* where = "";
  str = "Hey, there is nothing lying on the table";
  where = strstr(str, "nothing");
  printf("%s\n", where);
  return 0;
}

Do not take this as an example for anything. I leave it here for authenicity.

boughtonp · 02-14-2021, 06:52 AM

Quote:

Originally Posted by title

[regex] when would you prefer back-references or String tokenizers?

Uh? Your title asks about back-references, your question text makes no further references, or even allusions, to back-references.

Since it's not clear what you're actually asking, I've no idea if any of the following is what you're after or not...

Quote:

Originally Posted by Michael Uplawski

I revisit Jeffrey Friedl's great book on Regular Expressions. Each time I wonder if I should learn Perl, just for the fun of it.

Don't conflate "Perl" and "regex".

Sure Perl has a powerful and flexible regex engine, but there is far more to the language than that, and I'm sure it's possible to use/learn it without regex, (if for some reason one wanted to).

It is certainly possible to learn regex without touching Perl.

Quote:

But remembering my past solutions in code, when Stings had to be matched, split up into pieces, analyzed in any way, I have to admit that I avoided Regular Expressions, if I could do the same thing with a simple tokenizer.

If there's a simple direct solution, use it.

I would never choose to write ".*," to match the first word of a sentence - the first choice would be a function that delimited the string with commas, e.g. "ListFirst(string)" or "string.split(',',2)[0]" (or similar), and the second would be patterns like "[^,]*," or "\w+(?=,)" depending on the specific need.

Quote:

you will not need to know much about Regular Expressions, even if these functions and methods often accept a Regulalr Expression as parameter.

There is not much you can do with regex that can't also been done another way. That doesn't mean regex isn't an incredibly useful tool when applied appropriately.

Quote:

Would you formulate a rule or just present an experience which talks about giving precedence to one over the other?

Same way you choose what to eat for dinner - a combination of experience, preference, and available options.

When you know regex, it can be a quick and concise way to describe text you want to do something with.

Even when I intend to use a proper parser for some data, it can be a great 90% solution for the first draft that lets me focus on the bulk of the code, and come back to the precise format later.

And it's generally a lot quicker to use tools like awk/grep/sed than doing adhoc string tokenization.

Michael Uplawski · 02-14-2021, 11:06 AM

Quote:

Originally Posted by boughtonp

Uh? Your title asks about back-references, your question text makes no further references, or even allusions, to back-references.

Since it's not clear what you're actually asking, I've no idea if any of the following is what you're after or not...

I confused two expressions, “back-references” and “capture groups”. Although related, my question was more about the latter, as you have noted, it seems. Before “the book”, I did not know these existed, although I had seen the syntax, earlier.
Btw. There is a back-reference in my last Ruby-example.

Quote:

I would never choose to write ".*," to match the first word of a sentence - the first choice would be a function that delimited the string with commas, e.g. "ListFirst(string)" or "string.split(',',2)[0]" (or similar), and the second would be patterns like "[^,]*," or "\w+(?=,)" depending on the specific need.

Yes, that is how I rather use String.split[<delimeter>] than a regex. Now I wonder in how many cases the following actions could be facilitated if a pertinent regex permits the direct access to any of the created tokens. It certainly depends on the needs and the objective of the exercise. On the downside, the more a regex is capable to do, the less it may be maintainable. I do not know this for sure, for lack of experience.

Quote:

There is not much you can do with regex that can't also been done another way.

My insecurity originates from this fact, I guess.

Quote:

When you know regex, it can be a quick and concise way to describe text you want to do something with.

When working in a team, does it not increase the need for communication? I like to comment my code, but describing a regex to be fully understood by everybody is certainly less fun.

Quote:

Even when I intend to use a proper parser for some data, it can be a great 90% solution for the first draft that lets me focus on the bulk of the code, and come back to the precise format later.

Voilà. This is a thing to consider and to keep in mind. “Prototyping” with regex should even allow to highlight possible pitfalls and things to keep an eye on, when you devise a simpler tokenizer..!?

Quote:

And it's generally a lot quicker to use tools like awk/grep/sed than doing adhoc string tokenization.

With an emphasis on “adhoc”.

See? This is not all futile.

I just cannot do all the coding and do not have enough ideas to cover all these aspects, let alone the experience.

TY.

Michael

shruggy · 02-14-2021, 11:55 AM

Jeffrey Friedl once blogged about the (in)famous Jamie Zawinski's quote:

Quote:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

boughtonp · 02-14-2021, 01:36 PM

Quote:

Originally Posted by Michael Uplawski

I confused two expressions, “back-references” and “capture groups”. Although related, my question was more about the latter, as you have noted, it seems. Before “the book”, I did not know these existed, although I had seen the syntax, earlier.
Btw. There is a back-reference in my last Ruby-example.

Ah, as is evident, I didn't really pay a great of attention to the examples. (I was going to comment on each one, then ...didn't, for some reason.)

I think I understand the angle you're coming from now, and hopefully the rest of this post answers it better.

Quote:

Now I wonder in how many cases the following actions could be facilitated if a pertinent regex permits the direct access to any of the created tokens.

I'm not sure I catch what you mean with "following actions", but this is what capture groups do - they capture the text, and can be used via in-pattern back-references, replacement string back-references, or output as part of the match data (in languages with such functionality).

Modern regex implementations (e.g. Perl/Python/Java) allow named capture groups as well as the traditional positional ones, and getting an array of matches each with named key/value pairs for the groups can be a nice way to tokenize. There is no problem maintaining when formatted sensibly...

Quote:

When working in a team, does it not increase the need for communication?

The less experienced a team (in any technology) the greater the need for communication, and there's nothing wrong with that - how else would anyone learn.

Quote:

I like to comment my code, but describing a regex to be fully understood by everybody is certainly less fun.

It sounds like you're saying you like to waste time by repeating yourself?

Comments should be used to explain code that cannot be readily understood by a proficient developer reading through the code (and should generally focus on why not what).
Inexperienced developers should not be relying on comments; when they encounter code they don't understand, that's their opportunity to learn.

At the same time, it's important to remember that most regex implementations do not require everything compacted into dense single-line strings that regex is sometimes infamous for, and have a comment mode flag that ignores unescaped whitespace and allows "#" to start comments.

There's a couple of examples here: https://www.linuxquestions.org/questions/programming-9/seeking-interesting-regex-samples-4175671487/#post6101696

Michael Uplawski · 02-14-2021, 03:16 PM

Quote:

Originally Posted by boughtonp

Modern regex implementations (e.g. Perl/Python/Java) allow named capture groups as well as the traditional positional ones, and getting an array of matches each with named key/value pairs for the groups can be a nice way to tokenize. There is no problem maintaining when formatted sensibly...

I have never used more than 2 capture groups, I believe. And my regex are mostly for single-shot matches, but I see the charm of named capture groups.

Quote:

There's a couple of examples here: https://www.linuxquestions.org/questions/programming-9/seeking-interesting-regex-samples-4175671487/#post6101696

The test of email addresses is a nice example. May I ask if the non-capturing groups in this precise case really make a difference? Maybe it depends on the number of mail-addresses which are tested.
Also, the use of the comment-mode renders the code comparable to that of any other kind of tokenizer. Meaning that the amount of detail, covered by the regex, becomes an obvious advantage if the targeted data is relatively unspecific.

TY.

boughtonp · 02-15-2021, 08:04 AM

Quote:

Originally Posted by Michael Uplawski

May I ask if the non-capturing groups in this precise case really make a difference? Maybe it depends on the number of mail-addresses which are tested.

You mean in performance terms?

I'm not sure I've ever measured it, but to me it's mostly an indication of intent - I'm grouping this but explicitly don't care about its value.

Thinking about it, I'm getting curious - since a regex engine needs to store backtracking information, it's going to have start position and length of every unit/atom already, so the difference between an unused capture group and a non-capturing group might effectively only be an extra int/ID being assigned to that particular section, or perhaps even only a boolean with counting done on retrieval.

If the capture group causes text to be internally duplicated then that could start adding up, given long enough text and/or enough matches, but even then - compared to the bloat of modern software - it'd likely take a lot before it became significant.

You've made me want to go investigate how different regex engines do it and see if there are any meaningful differences between them.

shruggy · 02-15-2021, 08:11 AM

Quote:

Originally Posted by boughtonp

You've made me want to go investigate how different regex engines do it and see if there are any meaningful differences between them.

An old, but still relevant article by Russ Cox (author of RE2).

boughtonp · 02-15-2021, 09:05 AM

Quote:

Originally Posted by shruggy

An old, but still relevant article by Russ Cox (author of RE2).

Is that the one that boils down to "regex can be faster if you remove useful features"?

In any case there's no reference to capture groups in it, so it doesn't cover what I was referring to.

shruggy · 02-15-2021, 10:31 AM

This is what he has to say about backreferences though:

Quote:

Backreferences. As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. (Specifically, the problem is NP-complete, meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.) The simplest, most effective strategy for backreferences, taken by the original awk and egrep, is not to implement them. This strategy is no longer practical: users have come to rely on backreferences for at least occasional use, and backreferences are part of the POSIX standard for regular expressions. Even so, it would be reasonable to use Thompson's NFA simulation for most regular expressions, and only bring out backtracking when it is needed. A particularly clever implementation could combine the two, resorting to backtracking only to accommodate the backreferences.

At least, his own RE2 library implements capturing groups.

Michael Uplawski · 02-16-2021, 12:04 AM

On non-capturing groups, Jeffrey Friedl writes that either they are pratical in that they avoid global variables being “used up” (my words) for uninteresting data or they are, in fact and as boughtonp already sais, signalling intent. The gain in efficiency should depend on the amount of data, munged or the number of times a regex is applied in a loop or similar.

I am not sure about the gain in readability where capturing groups are mixed with non-capturing ones.., maybe in comment-mode. In my opinion they do, though, signal well the fact that a value is not used in the later analysis and will do so when you return much later to adapt your code to new requirements.

I will find out how qutebrowser integrates spell-checking; bear with my English for now.

Cheerio.

Michael Uplawski · 02-16-2021, 06:42 AM

Quote:

Originally Posted by boughtonp

You've made me want to go investigate how different regex engines do it and see if there are any meaningful differences between them.

I am not in a position to contribute wise words to this endeavor, but just add that some tools switch between DFA and NFA as needed and guess that it can apply to PCREs, too. It is therefore really important to evaluate *engines* and *then* to see what is under the hood of any tool.

You only look at NFAs, I know. However, if there are results, keep them as talkative as possible. You never know when this thread on LQ comes up in a search result and who will use it in an argument.