Complex regular expression with look arounds

FranekW · 07-17-2017, 08:06 AM

Hello,

I have to implement a complex regular expression to extract codes from huge text. I am after one tutorial which helped me to understand a thing or two. I would say I have understood how reg. exp. proceeds with a simple pattern, repetition, backtracking etc. I know tokens in a pattern are compared with a string. If a pattern fails at the first character, it proceeds to the second one etc. starts again with the first token. It is repeated until the string or the pattern is exhausted.

However, I find look around(s) quite confusing. I would be grateful if someone could provide an explanation and perhaps an example. Tutorials are good but not everything is clear! I know that look arounds work like assertions, though.

Let's say I have an arbitrary complex pattern to find with optional items and repetitions:

Code:

(?=pattern1+)(?pattern2*)[pattern-to-extract1](?=patern3)(?=patternN)[pattern-to-extract2]

etc.

how it is handled by reg. exp. engine:

does it treats everything like AND or OR; when the first assertion fails, does it proceed to next one? E.g., something similar to C++ if A && B && C, where B is omitted when A fails.
If one of the assertions is true, would reg. exp. check the other remaining expressions as well?
zero-length assertion - I think this one is the most confusing. Also, when the first look ahead is checked, does it move position how about tokens and position in a string?

pan64 · 07-17-2017, 08:29 AM

probably this helps: http://www.regular-expressions.info/lookaround.html
in general 1: you need to keep it simple, 2: the whole regexp must match.
And finally you ought to show some examples what did you find confusing?

FranekW · 07-17-2017, 11:36 AM

Quote:

Originally Posted by pan64

probably this helps: http://www.regular-expressions.info/lookaround.html
in general 1: you need to keep it simple, 2: the whole regexp must match.
And finally you ought to show some examples what did you find confusing?

Ad 2. means to me that results from each look around groups must provides a true assertion.

I have this regular expression:

Code:

(?=[a-z0-9\/]*\d)(?<=[^:.-]\b)[\w\/]+(?=\b[^:.-])

and I can't figure out how exactly it works. (link to RE Online).

The middle part where the string is actually captured non-zero assertion, "[\w+\/]+", is not a problem at all. The first group "(?=[a-z0-9\/]*\d)" searches entire line for a sub-strings that might start with any character defined inside the class followed by a digit.

The confusing part is the look behind and the second look ahead with "\b[^:.-]". I can't understand what is happening there. This is quite funny because it gives exactly what I want but I can't figure out why this works. I've got help from SO community. The first version of this reg.exp. was suggested by someone else. It is not entirely mine. It was not capturing two digit codes therefore I changed that. I found the pattern by some sort of try and error -.-

Thanks

pan64 · 07-18-2017, 01:57 AM

you have positive lookahead and lookbehind, and based on the link I posted that means they should match too, but they will not be part of the result.
So splitting this expression into 3 parts:

Code:

(?=[a-z0-9\/]*\d)(?<=[^:.-]\b)

or if you allow me to exchange those:

Code:

(?<=[^:.-]\b)(?=[a-z0-9\/]*\d)

will look for (zero length) locations preceeded by [^:.-]\b and followed by [a-z0-9\/]*\d

The next part is easy, it is your [\w\/]+

Finally you are looking for a match which is followed by \b[^:.-]

If I understand it well the first lookahead [a-z0-9\/]*\d is actually part of the string you are looking for.

I still do not really understand what are you looking for, but I think there should be a not-so-difficult regexp.

FranekW · 07-18-2017, 02:31 AM

Quote:

Originally Posted by pan64

If I understand it well the first lookahead [a-z0-9\/]*\d is actually part of the string you are looking for.

That's correct. This extracts parts of a string I am interested in.

Quote:

Originally Posted by pan64

I still do not really understand what are you looking for, but I think there should be a not-so-difficult regexp.

I know but I could not share the real text here and decided provide only a representative example.

I read through the tutorial you provided the link to and it's been very helpful but I also found another one Mastering Lookahead and Lookbehind and Reducing (? … ) Syntax Confusion, which is really good complementary information with multiple examples. I found out that in Python, which I am using for this, we can install an additional regular expression module regex which is way much better than standard Python's re.

josephj · 07-18-2017, 05:02 AM

While regular expressions, especially the type you are using, are extremely powerful (and I don't understand the fancy stuff you're doing - I just know back references), sometimes, it's much quicker to write and debug in awk.

awk is fast (usually not as fast as sed), a whole lot more human-readable, and you can do things in steps instead of having to get the whole process down in one swallow. It's easy to add print statements for debugging to make sure each part of what you are doing is performing as expected.

You can also add comments for each separate part, making it easier to understand. (Of course, you can do that for a regular expression in your code, but it's a bit harder to do so clearly and readably.)

If you are having any difficulty understanding it now, that is likely to recur when you come back to it after time. And if someone else has to maintain your code, it may take them even longer because they also have to understand what you were trying to do in addition to how you did it.

When a regex fails, it just fails, usually giving you no clue as to why it didn't do what you intended. Many typographic errors
in a regular expression which may drastically alter its behavior are almost invisible. Sometimes they are referred to as write-only code.

I'm not against regular expressions. I use them all the time, but when they start to get complex, I try to find another way to solve the problem.

syg00 · 07-18-2017, 05:23 AM

Isn't this why Larry insisted Perl 6 was needed - ok, one of the reasons.
Rules (in 6) look useful, but I've never gotten around to putting any time into Perl 6 I must admit.

FranekW · 07-18-2017, 05:44 AM

@josephj
Thanks for your comments. It's very useful and I agree on difficulties to read code after some time.

I hardly know awk apart from one or two tries. It looks like it work like regular expression engines? I started with Linux a couple of months ago and across "awk" before a couple of times. I was planning to start learning it at some points. The "point" is always pushed further and further due to work. I would be extremely happy to have anything that can debug a reg. exp. patterns. If awk can do that, I am definitely going to give it a go ASAP

Thanks.

pan64 · 07-18-2017, 06:26 AM

no, awk is not able to debug regexp. Also awk is not fully compatible with PCRE, just similar.
As far as I know awk cannot handle lookaround type regexps at all. But most probably you can implement your query in awk quite easily, without that.

josephj · 07-19-2017, 02:05 AM

As @pan64 says, awk isn't another way to do regular expressions (although it does support a dialect of them).

awk has an implied loop such that (by default - there are a bunch of ways to modify this) the whole program is applied to each succeeding line of input.

The following is more of a hypothetical scenario than a general description of awk. awk is a full programming language which is geared toward text/string manipulation. This makes it ideal for filtering/checking/transforming input in various ways.

You can have a clause that says, "does this line have (part of) what I'm looking for in it". If it doesn't you can just output the line as is or skip to the next line.

A subsequent clause could start with the assumption that the first clause already succeeded - or you wouldn't have gotten here.
For the purposes of this discussion, this is the nice part because your first clause solved one small piece of the problem and now your second clause can solve another small piece, thus breaking a complex problem into smaller more manageable and debugable sub-tasks.

Here, you don't need forward and back references because you can just find things and store them in simple variables and examine them at will.

Because awk is a line-at-a-time language, you have to do a bit more work if the the thing you're working with spans more than one line, but it's just a matter of saving a few more things in variables.

You don't need it here, but one of my favorite things about awk is its associative arrays. (It might have been the first language to offer these. [ducking and covering]) You can do things like my_table["grapes"] = 10 . You can put in whatever you want and get them back out in various ways.

The main point of all of this rambling is that, IMHO, awk programming is easy and quick compared to many other languages.

It's not intended for writing huge programs like compilers (although I believe it's been done), but for short to medium programs for processing text, it's hard to beat.

FranekW · 07-19-2017, 02:14 PM

I understand now. Thanks for clarifying this for me

As soon as I have a little more time, I am going to try awk.