LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-17-2020, 04:33 AM   #1
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,386
Blog Entries: 3

Rep: Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776
Seeking interesting regex samples


I am looking to collect a samples of interesting or particularly useful regular expressions. POSIX, Extendend, or PCRE are all good. I've seen RexEgg, Ryan's Tutorial's Regular Expressions!, and the Regular Expressions Tutorial to name three. I've also cruised through a bit of code but the syntax is not something that can easily be found genericly with grep. So I am asking for what you recall:

Which interesting and/or useful patterns have you seen, made, and/or used?

Myself, I have recently dealt with this one to place commas in a rather long number:

Code:
s/(\d)(?=(\d{3})+$)/$1,/g;
Strip leading and trailing white space:

Code:
s/^\s+//; s/\s+$//;
Strip hidden Unicode word joiner:

Code:
s/\x{2060}//g;
However, it has not occured to me to keep note of them over the years and most have been one-offs. It would be interesting to know which regular expressions have been particularly useful or interesting, especially the longer, more complex ones.
 
Old 03-17-2020, 05:02 AM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,517

Rep: Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377
Chase up Spamassassin rules for good examples of PCREs. IMHO, Posix REs vary a lot in how they're interpreted, and are less useful as a result. You can't do something clever, send it out into the world, and realistically expect it to do the same clever thing everywhere, sadly.

https://spamassassin.apache.org/

Spamassassin has extra rulesets and they were more useful than the standard rules back when I was using it. Of course that landscape changes rapidly, as spammers adapt. For instance, I was using it, 75% of spam was composed using M$ FrontPage. I didn't want to see anything composed with FrontPage, so I used rule out 75% of spam on that alone.I'm sure people have wizened up in 15 years.
 
1 members found this post helpful.
Old 03-17-2020, 05:15 AM   #3
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,386

Original Poster
Blog Entries: 3

Rep: Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776
Thanks. That's the kind of trove I am looking for.
 
Old 03-17-2020, 06:58 AM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
Whilst doing a rails tutorial, way back when, I found these 2 quite interesting:
Code:
# check an email address (this comes with thecaveat it is nowhere near extensive)
VALID_EMAIL_REGEX = /\A[\w+\-.]+@[a-z\d\-.]+\.[a-z]+\z/i

# password checker - between 6 & 20 chars long, at least 1 digit,lower case letter, upper case letter and symbol, no 2 same characters side by side eg NN no, nN yes
VALID_PASSWORD_REGEX = /\A((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%])(?!.*(.)\2).{6,20})\z/
You can test them here
 
1 members found this post helpful.
Old 03-17-2020, 08:23 AM   #5
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,653

Rep: Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580
Quote:
Originally Posted by Turbocapitalist View Post
I am looking to collect a samples of interesting or particularly useful regular expressions.
Why?


I'd consider most of the patterns I write are useful (or I wouldn't write them), but interesting is definitely subjective.


 
Old 03-17-2020, 08:31 AM   #6
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,386

Original Poster
Blog Entries: 3

Rep: Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776
Yes, they are all useful, or at least most of them are. However, some end up saving more effort than others just because they happen to be a better fit for a particular task.

Why am I looking? It is because I intend to try to explain how to use them and wish to provide better examples than I can come up with on my own as well as provide examples from outside my areas of interest or specialty. I notice that authors and even whole projects tend to use the patterns in a rather narrow way and I am no exception to that limitation. Thus I would prefer to collect a broader range of examples in order to try to make it easier to understand the principles involved.
 
1 members found this post helpful.
Old 03-17-2020, 08:31 AM   #7
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,653

Rep: Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580
Quote:
Originally Posted by grail View Post
Whilst doing a rails tutorial, way back when, I found these 2 quite interesting:
Code:
# check an email address (this comes with thecaveat it is nowhere near extensive)
VALID_EMAIL_REGEX = /\A[\w+\-.]+@[a-z\d\-.]+\.[a-z]+\z/i

# password checker - between 6 & 20 chars long, at least 1 digit,lower case letter, upper case letter and symbol, no 2 same characters side by side eg NN no, nN yes
VALID_PASSWORD_REGEX = /\A((?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%])(?!.*(.)\2).{6,20})\z/
You can test them here
Please do not use either of those patterns, they are both broken.

If you need a valid email, send a message with a verification code.
If you need a secure password, use a library like zxcvbn which performs actual strength checks and does not set insecure length limits.

 
1 members found this post helpful.
Old 03-17-2020, 08:41 AM   #8
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,653

Rep: Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580
Quote:
Originally Posted by Turbocapitalist View Post
It is because I intend to try to explain how to use them and wish to provide better examples than I can come up with on my own as well as provide examples from outside my areas of interest or specialty. I notice that authors and even whole projects tend to use the patterns in a rather narrow way and I am no exception to that limitation. Thus I would prefer to collect a broader range of examples in order to try to make it easier to understand the principles involved.
Excellent - that's a good reason to be asking.

I have some longer patterns that should be helpful for that, but limited time and a very flakey connection right now, so I'll come back later to share them.

 
Old 03-17-2020, 08:41 AM   #9
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,386

Original Poster
Blog Entries: 3

Rep: Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776
Quote:
Originally Posted by boughtonp View Post
Please do not use either of those patterns, they are both broken.
I've added footnotes explaining that they are flawed and only an illustration. I guess part of this will be explaining where regex is not a good fit. Or not a good fit by themselves.

Last edited by Turbocapitalist; 03-17-2020 at 08:44 AM.
 
Old 03-17-2020, 05:02 PM   #10
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,653

Rep: Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580Reputation: 2580
Quote:
Originally Posted by Turbocapitalist View Post
I guess part of this will be explaining where regex is not a good fit. Or not a good fit by themselves.
Yep, and that's an important thing to stress - regex can be fun, but there's a lot of situations where it's not the right tool, or only part of the answer.

Email address is a good example - there's an RFC compliant regex somewhere that is 5000+ characters long, and is ridiculous.

Before I realised parsing email addresses with regex wasn't useful, I read the RFC and wrote a far simpler version:

Code:
(?x)
^
# Mailbox - the first part of an email can contain almost any character.
(?:
   # Quoted
   "[^"]++"
|
   # Atoms
   (?:[^@ ]+|(?<!\\)(?:\\\\)*+\\[@ ])+
)
@
# Host - the second part will be a domain or IPv4 or IPv6.
(?:
   # Domain or IPv4
   [\w\-]++(?:\.[\w\-]++)*+
|
   # IPv6
   [a-fA-F0-9:]{3,39}+
)
$
That's great to use if someone tries using the stupid 5k char regex as an example of regex being a bad tool - this version is readable and will accept any RFC compliant address (and a few others), but who cares whether the syntax is valid if the domain itself is incorrect - and who's going to enter the (technically valid) address of "@"\@@::1 and expect to get a message?

Feature-wise, the most obvious/interesting thing there is the use of comment mode, enabled with the x flag or with the (?x) directive, which means whitespace is ignored (unless escaped or in a character class) and unescaped hashes begin line comments.
It helps a lot to make patterns readable.

Another uncommonly used feature demonstrated there is possessive quantifiers - obtained by adding a + to an existing quantifier - they are almost the same as standard (greedy) quantifiers, but also prevent backtracking within their match (which is vital in some situations). I did go through a phase of using them even when they weren't strictly necessary, which has the downside of making it less obvious when preventing backtracking is necessary.


Here's another pattern that uses comment mode - but I've also swapped comments for named groups to demonstrate another under-used feature. Named groups can have other syntaxes in different engines, but most support this (?<name>group) variant.

It came from a function for checking whether a string was a correctly formatted locale/language tag, according to RFC 5646, and is a good example of breaking up what could otherwise be a dense mass of characters, into labelled parts, with the added advantage that each group will identify the relevant parts of the tag, so they can be individually validated.

Code:
(?xi)

(?<language> # ISO 639
   [a-z]{2,3}
)
(?<extlang> # ISO 639
   (?:-[a-z]{3}){0,3}?
)
(?<script> # ISO 15924
   (?:-[a-z]{4})?
)
(?<region> # ISO 3166-1
   (?:-[a-z]{2})?
)
(?<variants>
   (?:-[a-z0-9]{5,8}|-[0-9][a-z0-9]{3})*
)
(?<extensions>
   (?:-(?!x)[a-z0-9](?:-[a-z0-9]{2,8})+)*
)
(?<privateuse>
   (?:-x(?:-[a-z0-9]{1,8})+)?
)
See how much nicer the above is compared to how a lot of people might have written it (before going on to complain about lack of readability)...
Code:
(?i)[a-z]{2,3}(?:-[a-z]{3}){0,3}?(?:-[a-z]{4})?(?:-[a-z]{2})?(?:-[a-z0-9]{5,8}|-[0-9][a-z0-9]{3})*(?:-(?!x)[a-z0-9](?:-[a-z0-9]{2,8})+)*(?:-x(?:-[a-z0-9]{1,8})+)?
Anyway, that'll do for now - hope it's interesting and not boring.

 
2 members found this post helpful.
Old 03-18-2020, 04:59 AM   #11
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,386

Original Poster
Blog Entries: 3

Rep: Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776
Thanks. These are all great. The comment modifier (/x) makes longer expressions much more approachable. Nice point about the named capture groups, they are quite underrated.
 
Old 03-30-2020, 05:24 AM   #12
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,375

Rep: Reputation: 2755Reputation: 2755Reputation: 2755Reputation: 2755Reputation: 2755Reputation: 2755Reputation: 2755Reputation: 2755Reputation: 2755Reputation: 2755Reputation: 2755
Here is the book on Regex engines and languages/implementations (there are more than one): http://regex.info/book.html.
Highly recommended for the content and amazingly easy to read, especially when you consider it can be a very dry subject.
 
1 members found this post helpful.
Old 03-31-2020, 04:13 AM   #13
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,517

Rep: Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377Reputation: 2377
Ah, if C. Northcote Parkinson (of Parkinson's Laws & the Peter Principle) had lived to see Unix, there would certainly be a Law about Regexes.
 
Old 03-31-2020, 04:22 AM   #14
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,386

Original Poster
Blog Entries: 3

Rep: Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776Reputation: 3776
I'm in the process of mining some more source:

Code:
mkdir /tmp/Src/
cd /tmp/Src/

apt-cache search '[a-z]' \
 | awk '{print $1}' \
 | xargs apt-cache dotty \
 | awk '$3~/^"python3/&&$1!~/^"python/{gsub(/"/,"",$1); print $1;}' \
 | sort -u \
 | xargs -I{} apt-get source {} 2>/dev/null

apt-cache search '[a-z]' \
 | awk '{print $1}' \
 | xargs apt-cache dotty \
 | awk '$3~/^"perl/&&$1!~/^"perl/{gsub(/"/,"",$1); print $1;}' \
 | sort -u \
 | xargs -I{} apt-get source {} 2>/dev/null

find /tmp/Src/ -type f -not -name '*.py' -not -name '*.pl' -delete
It's a little harder than expected to just grab patterns from the source. I'm becoming a fan of the m// syntax in perl and the m helps spot the pattern. That is because some, in perl, are using unusual delimiters and thus very hard to identify except by context. Others, in both perl and python, are spread out over multiple lines. That might be addressible but not yet entertaining enough to pursue.

Edit: Note the patterns in this particular post are not interesting. They are a means to an end and I want to remember them for later. Instead, the ones which they dig out of the mass of source code are what is interesting.

Code:
time find /tmp/Src/ -type f -name '*.py' \
	-exec grep -q -m 1 -E 'import\s+re' {} \; \
	-exec sh -c "grep -h -w -E 're\.(sub|match)' {} \
	| sed -r -e 's/^[^#]*re\.(sub|match)/re.\1/g;'" \; \
	| sort -u \
	> /tmp/python-patterns.txt

Last edited by Turbocapitalist; 03-31-2020 at 08:17 AM. Reason: grep -> awk
 
Old 03-31-2020, 04:41 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 22,129

Rep: Reputation: 7374Reputation: 7374Reputation: 7374Reputation: 7374Reputation: 7374Reputation: 7374Reputation: 7374Reputation: 7374Reputation: 7374Reputation: 7374Reputation: 7374
just an offopic comment: "interesting" is just a personal preference, so I do not think there can be a general answer for this question.
another one: personally I do not really find the posted regex strings useful. Someone uses a webserver, other one an sql engine ....
and a final one: in our company/in may daily work I (we) use regex very often, but you will not find them interesting.
 
  


Reply

Tags
regex



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] differences between shell regex and php regex and perl regex and javascript and mysql golden_boy615 Linux - General 2 04-19-2011 01:10 AM
Perl to find regex and print following 5 lines after regex casperdaghost Linux - Newbie 3 08-29-2010 08:08 PM
LXer: Interesting new Ubuntu-derived, OS X-inspired distro, interesting revenue (yes, LXer Syndicated Linux News 0 05-01-2009 08:51 AM
regex with sed to process file, need help on regex dwynter Linux - Newbie 5 08-31-2007 05:10 AM
Need a regex, I suck at regex's d3funct Programming 4 02-25-2002 08:28 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:38 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration