Seeking interesting regex samples

Turbocapitalist · 03-17-2020, 04:33 AM

I am looking to collect a samples of interesting or particularly useful regular expressions. POSIX, Extendend, or PCRE are all good. I've seen RexEgg, Ryan's Tutorial's Regular Expressions!, and the Regular Expressions Tutorial to name three. I've also cruised through a bit of code but the syntax is not something that can easily be found genericly with grep. So I am asking for what you recall:

Which interesting and/or useful patterns have you seen, made, and/or used?

Myself, I have recently dealt with this one to place commas in a rather long number:

Code:

s/(\d)(?=(\d{3})+$)/$1,/g;

Strip leading and trailing white space:

Code:

s/^\s+//; s/\s+$//;

Strip hidden Unicode word joiner:

Code:

s/\x{2060}//g;

However, it has not occured to me to keep note of them over the years and most have been one-offs. It would be interesting to know which regular expressions have been particularly useful or interesting, especially the longer, more complex ones.

business_kid · 03-17-2020, 05:02 AM

Chase up Spamassassin rules for good examples of PCREs. IMHO, Posix REs vary a lot in how they're interpreted, and are less useful as a result. You can't do something clever, send it out into the world, and realistically expect it to do the same clever thing everywhere, sadly.

https://spamassassin.apache.org/

Spamassassin has extra rulesets and they were more useful than the standard rules back when I was using it. Of course that landscape changes rapidly, as spammers adapt. For instance, I was using it, 75% of spam was composed using M$ FrontPage. I didn't want to see anything composed with FrontPage, so I used rule out 75% of spam on that alone.I'm sure people have wizened up in 15 years.

Turbocapitalist · 03-17-2020, 05:15 AM

Thanks. That's the kind of trove I am looking for.

grail · 03-17-2020, 06:58 AM

Whilst doing a rails tutorial, way back when, I found these 2 quite interesting:

Code:

# check an email address (this comes with thecaveat it is nowhere near extensive)
VALID_EMAIL_REGEX = /\A[\w+\-.]+@[a-z\d\-.]+\.[a-z]+\z/i

# password checker - between 6 & 20 chars long, at least 1 digit,lower case letter, upper case letter and symbol, no 2 same characters side by side eg NN no, nN yes
VALID_PASSWORD_REGEX = /\A((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%])(?!.*(.)\2).{6,20})\z/

You can test them here

boughtonp · 03-17-2020, 08:23 AM

Quote:

Originally Posted by Turbocapitalist

I am looking to collect a samples of interesting or particularly useful regular expressions.

Why?

I'd consider most of the patterns I write are useful (or I wouldn't write them), but interesting is definitely subjective.

Turbocapitalist · 03-17-2020, 08:31 AM

Yes, they are all useful, or at least most of them are. However, some end up saving more effort than others just because they happen to be a better fit for a particular task.

Why am I looking? It is because I intend to try to explain how to use them and wish to provide better examples than I can come up with on my own as well as provide examples from outside my areas of interest or specialty. I notice that authors and even whole projects tend to use the patterns in a rather narrow way and I am no exception to that limitation. Thus I would prefer to collect a broader range of examples in order to try to make it easier to understand the principles involved.

boughtonp · 03-17-2020, 08:31 AM

Quote:

Originally Posted by grail

Whilst doing a rails tutorial, way back when, I found these 2 quite interesting:

Code:

# check an email address (this comes with thecaveat it is nowhere near extensive)
VALID_EMAIL_REGEX = /\A[\w+\-.]+@[a-z\d\-.]+\.[a-z]+\z/i

# password checker - between 6 & 20 chars long, at least 1 digit,lower case letter, upper case letter and symbol, no 2 same characters side by side eg NN no, nN yes
VALID_PASSWORD_REGEX = /\A((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%])(?!.*(.)\2).{6,20})\z/

You can test them here

Please do not use either of those patterns, they are both broken.

If you need a valid email, send a message with a verification code.
If you need a secure password, use a library like zxcvbn which performs actual strength checks and does not set insecure length limits.

boughtonp · 03-17-2020, 08:41 AM

Quote:

Originally Posted by Turbocapitalist

It is because I intend to try to explain how to use them and wish to provide better examples than I can come up with on my own as well as provide examples from outside my areas of interest or specialty. I notice that authors and even whole projects tend to use the patterns in a rather narrow way and I am no exception to that limitation. Thus I would prefer to collect a broader range of examples in order to try to make it easier to understand the principles involved.

Excellent - that's a good reason to be asking.

I have some longer patterns that should be helpful for that, but limited time and a very flakey connection right now, so I'll come back later to share them.

Turbocapitalist · 03-17-2020, 08:41 AM

Quote:

Originally Posted by boughtonp

Please do not use either of those patterns, they are both broken.

I've added footnotes explaining that they are flawed and only an illustration. I guess part of this will be explaining where regex is not a good fit. Or not a good fit by themselves.

boughtonp · 03-17-2020, 05:02 PM

Quote:

Originally Posted by Turbocapitalist

I guess part of this will be explaining where regex is not a good fit. Or not a good fit by themselves.

Yep, and that's an important thing to stress - regex can be fun, but there's a lot of situations where it's not the right tool, or only part of the answer.

Email address is a good example - there's an RFC compliant regex somewhere that is 5000+ characters long, and is ridiculous.

Before I realised parsing email addresses with regex wasn't useful, I read the RFC and wrote a far simpler version:

Code:

(?x)
^
# Mailbox - the first part of an email can contain almost any character.
(?:
   # Quoted
   "[^"]++"
|
   # Atoms
   (?:[^@ ]+|(?<!\\)(?:\\\\)*+\\[@ ])+
)
@
# Host - the second part will be a domain or IPv4 or IPv6.
(?:
   # Domain or IPv4
   [\w\-]++(?:\.[\w\-]++)*+
|
   # IPv6
   [a-fA-F0-9:]{3,39}+
)
$

That's great to use if someone tries using the stupid 5k char regex as an example of regex being a bad tool - this version is readable and will accept any RFC compliant address (and a few others), but who cares whether the syntax is valid if the domain itself is incorrect - and who's going to enter the (technically valid) address of "@"\@@::1 and expect to get a message?

Feature-wise, the most obvious/interesting thing there is the use of comment mode, enabled with the x flag or with the (?x) directive, which means whitespace is ignored (unless escaped or in a character class) and unescaped hashes begin line comments.
It helps a lot to make patterns readable.

Another uncommonly used feature demonstrated there is possessive quantifiers - obtained by adding a + to an existing quantifier - they are almost the same as standard (greedy) quantifiers, but also prevent backtracking within their match (which is vital in some situations). I did go through a phase of using them even when they weren't strictly necessary, which has the downside of making it less obvious when preventing backtracking is necessary.

Here's another pattern that uses comment mode - but I've also swapped comments for named groups to demonstrate another under-used feature. Named groups can have other syntaxes in different engines, but most support this (?<name>group) variant.

It came from a function for checking whether a string was a correctly formatted locale/language tag, according to RFC 5646, and is a good example of breaking up what could otherwise be a dense mass of characters, into labelled parts, with the added advantage that each group will identify the relevant parts of the tag, so they can be individually validated.

Code:

(?xi)

(?<language> # ISO 639
   [a-z]{2,3}
)
(?<extlang> # ISO 639
   (?:-[a-z]{3}){0,3}?
)
(?<script> # ISO 15924
   (?:-[a-z]{4})?
)
(?<region> # ISO 3166-1
   (?:-[a-z]{2})?
)
(?<variants>
   (?:-[a-z0-9]{5,8}|-[0-9][a-z0-9]{3})*
)
(?<extensions>
   (?:-(?!x)[a-z0-9](?:-[a-z0-9]{2,8})+)*
)
(?<privateuse>
   (?:-x(?:-[a-z0-9]{1,8})+)?
)

See how much nicer the above is compared to how a lot of people might have written it (before going on to complain about lack of readability)...

Code:

(?i)[a-z]{2,3}(?:-[a-z]{3}){0,3}?(?:-[a-z]{4})?(?:-[a-z]{2})?(?:-[a-z0-9]{5,8}|-[0-9][a-z0-9]{3})*(?:-(?!x)[a-z0-9](?:-[a-z0-9]{2,8})+)*(?:-x(?:-[a-z0-9]{1,8})+)?

Anyway, that'll do for now - hope it's interesting and not boring.

Turbocapitalist · 03-18-2020, 04:59 AM

Thanks. These are all great. The comment modifier (/x) makes longer expressions much more approachable. Nice point about the named capture groups, they are quite underrated.

chrism01 · 03-30-2020, 05:24 AM

Here is the book on Regex engines and languages/implementations (there are more than one): http://regex.info/book.html.
Highly recommended for the content and amazingly easy to read, especially when you consider it can be a very dry subject.

business_kid · 03-31-2020, 04:13 AM

Ah, if C. Northcote Parkinson (of Parkinson's Laws & the Peter Principle) had lived to see Unix, there would certainly be a Law about Regexes.

Turbocapitalist · 03-31-2020, 04:22 AM

I'm in the process of mining some more source:

Code:

mkdir /tmp/Src/
cd /tmp/Src/

apt-cache search '[a-z]' \
 | awk '{print $1}' \
 | xargs apt-cache dotty \
 | awk '$3~/^"python3/&&$1!~/^"python/{gsub(/"/,"",$1); print $1;}' \
 | sort -u \
 | xargs -I{} apt-get source {} 2>/dev/null

apt-cache search '[a-z]' \
 | awk '{print $1}' \
 | xargs apt-cache dotty \
 | awk '$3~/^"perl/&&$1!~/^"perl/{gsub(/"/,"",$1); print $1;}' \
 | sort -u \
 | xargs -I{} apt-get source {} 2>/dev/null

find /tmp/Src/ -type f -not -name '*.py' -not -name '*.pl' -delete

It's a little harder than expected to just grab patterns from the source. I'm becoming a fan of the m// syntax in perl and the m helps spot the pattern. That is because some, in perl, are using unusual delimiters and thus very hard to identify except by context. Others, in both perl and python, are spread out over multiple lines. That might be addressible but not yet entertaining enough to pursue.

Edit: Note the patterns in this particular post are not interesting. They are a means to an end and I want to remember them for later. Instead, the ones which they dig out of the mass of source code are what is interesting.

Code:

time find /tmp/Src/ -type f -name '*.py' \
	-exec grep -q -m 1 -E 'import\s+re' {} \; \
	-exec sh -c "grep -h -w -E 're\.(sub|match)' {} \
	| sed -r -e 's/^[^#]*re\.(sub|match)/re.\1/g;'" \; \
	| sort -u \
	> /tmp/python-patterns.txt

pan64 · 03-31-2020, 04:41 AM

just an offopic comment: "interesting" is just a personal preference, so I do not think there can be a general answer for this question.
another one: personally I do not really find the posted regex strings useful. Someone uses a webserver, other one an sql engine ....
and a final one: in our company/in may daily work I (we) use regex very often, but you will not find them interesting.