LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-17-2017, 08:06 AM   #1
FranekW
Member
 
Registered: Apr 2017
Distribution: Manjaro and Ubuntu
Posts: 30

Rep: Reputation: Disabled
Complex regular expression with look arounds


Hello,

I have to implement a complex regular expression to extract codes from huge text. I am after one tutorial which helped me to understand a thing or two. I would say I have understood how reg. exp. proceeds with a simple pattern, repetition, backtracking etc. I know tokens in a pattern are compared with a string. If a pattern fails at the first character, it proceeds to the second one etc. starts again with the first token. It is repeated until the string or the pattern is exhausted.

However, I find look around(s) quite confusing. I would be grateful if someone could provide an explanation and perhaps an example. Tutorials are good but not everything is clear! I know that look arounds work like assertions, though.

Let's say I have an arbitrary complex pattern to find with optional items and repetitions:

Code:
(?=pattern1+)(?pattern2*)[pattern-to-extract1](?=patern3)(?=patternN)[pattern-to-extract2]
etc.

how it is handled by reg. exp. engine:
  • does it treats everything like AND or OR; when the first assertion fails, does it proceed to next one? E.g., something similar to C++ if A && B && C, where B is omitted when A fails.
  • If one of the assertions is true, would reg. exp. check the other remaining expressions as well?
  • zero-length assertion - I think this one is the most confusing. Also, when the first look ahead is checked, does it move position how about tokens and position in a string?
 
Old 07-17-2017, 08:29 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,804

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
probably this helps: http://www.regular-expressions.info/lookaround.html
in general 1: you need to keep it simple, 2: the whole regexp must match.
And finally you ought to show some examples what did you find confusing?
 
Old 07-17-2017, 11:36 AM   #3
FranekW
Member
 
Registered: Apr 2017
Distribution: Manjaro and Ubuntu
Posts: 30

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by pan64 View Post
probably this helps: http://www.regular-expressions.info/lookaround.html
in general 1: you need to keep it simple, 2: the whole regexp must match.
And finally you ought to show some examples what did you find confusing?
Ad 2. means to me that results from each look around groups must provides a true assertion.

I have this regular expression:

Code:
(?=[a-z0-9\/]*\d)(?<=[^:.-]\b)[\w\/]+(?=\b[^:.-])
and I can't figure out how exactly it works. (link to RE Online).

The middle part where the string is actually captured non-zero assertion, "[\w+\/]+", is not a problem at all. The first group "(?=[a-z0-9\/]*\d)" searches entire line for a sub-strings that might start with any character defined inside the class followed by a digit.

The confusing part is the look behind and the second look ahead with "\b[^:.-]". I can't understand what is happening there. This is quite funny because it gives exactly what I want but I can't figure out why this works. I've got help from SO community. The first version of this reg.exp. was suggested by someone else. It is not entirely mine. It was not capturing two digit codes therefore I changed that. I found the pattern by some sort of try and error -.-

Thanks
 
Old 07-18-2017, 01:57 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,804

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
you have positive lookahead and lookbehind, and based on the link I posted that means they should match too, but they will not be part of the result.
So splitting this expression into 3 parts:
Code:
(?=[a-z0-9\/]*\d)(?<=[^:.-]\b)
or if you allow me to exchange those:
Code:
(?<=[^:.-]\b)(?=[a-z0-9\/]*\d)
will look for (zero length) locations preceeded by [^:.-]\b and followed by [a-z0-9\/]*\d

The next part is easy, it is your [\w\/]+

Finally you are looking for a match which is followed by \b[^:.-]

If I understand it well the first lookahead [a-z0-9\/]*\d is actually part of the string you are looking for.

I still do not really understand what are you looking for, but I think there should be a not-so-difficult regexp.
 
Old 07-18-2017, 02:31 AM   #5
FranekW
Member
 
Registered: Apr 2017
Distribution: Manjaro and Ubuntu
Posts: 30

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by pan64 View Post
If I understand it well the first lookahead [a-z0-9\/]*\d is actually part of the string you are looking for.
That's correct. This extracts parts of a string I am interested in.

Quote:
Originally Posted by pan64 View Post
I still do not really understand what are you looking for, but I think there should be a not-so-difficult regexp.
I know but I could not share the real text here and decided provide only a representative example.

I read through the tutorial you provided the link to and it's been very helpful but I also found another one Mastering Lookahead and Lookbehind and Reducing (? … ) Syntax Confusion, which is really good complementary information with multiple examples. I found out that in Python, which I am using for this, we can install an additional regular expression module regex which is way much better than standard Python's re.
 
Old 07-18-2017, 05:02 AM   #6
josephj
Member
 
Registered: Nov 2007
Location: Northeastern USA
Distribution: kubuntu
Posts: 214

Rep: Reputation: 112Reputation: 112
When you can no longer tell the difference between your regex and line noise

While regular expressions, especially the type you are using, are extremely powerful (and I don't understand the fancy stuff you're doing - I just know back references), sometimes, it's much quicker to write and debug in awk.

awk is fast (usually not as fast as sed), a whole lot more human-readable, and you can do things in steps instead of having to get the whole process down in one swallow. It's easy to add print statements for debugging to make sure each part of what you are doing is performing as expected.

You can also add comments for each separate part, making it easier to understand. (Of course, you can do that for a regular expression in your code, but it's a bit harder to do so clearly and readably.)

If you are having any difficulty understanding it now, that is likely to recur when you come back to it after time. And if someone else has to maintain your code, it may take them even longer because they also have to understand what you were trying to do in addition to how you did it.

When a regex fails, it just fails, usually giving you no clue as to why it didn't do what you intended. Many typographic errors
in a regular expression which may drastically alter its behavior are almost invisible. Sometimes they are referred to as write-only code.

I'm not against regular expressions. I use them all the time, but when they start to get complex, I try to find another way to solve the problem.
 
1 members found this post helpful.
Old 07-18-2017, 05:23 AM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Isn't this why Larry insisted Perl 6 was needed - ok, one of the reasons.
Rules (in 6) look useful, but I've never gotten around to putting any time into Perl 6 I must admit.
 
Old 07-18-2017, 05:44 AM   #8
FranekW
Member
 
Registered: Apr 2017
Distribution: Manjaro and Ubuntu
Posts: 30

Original Poster
Rep: Reputation: Disabled
@josephj
Thanks for your comments. It's very useful and I agree on difficulties to read code after some time.

I hardly know awk apart from one or two tries. It looks like it work like regular expression engines? I started with Linux a couple of months ago and across "awk" before a couple of times. I was planning to start learning it at some points. The "point" is always pushed further and further due to work. I would be extremely happy to have anything that can debug a reg. exp. patterns. If awk can do that, I am definitely going to give it a go ASAP

Thanks.
 
Old 07-18-2017, 06:26 AM   #9
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,804

Rep: Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306Reputation: 7306
no, awk is not able to debug regexp. Also awk is not fully compatible with PCRE, just similar.
As far as I know awk cannot handle lookaround type regexps at all. But most probably you can implement your query in awk quite easily, without that.
 
1 members found this post helpful.
Old 07-19-2017, 02:05 AM   #10
josephj
Member
 
Registered: Nov 2007
Location: Northeastern USA
Distribution: kubuntu
Posts: 214

Rep: Reputation: 112Reputation: 112
As @pan64 says, awk isn't another way to do regular expressions (although it does support a dialect of them).

awk has an implied loop such that (by default - there are a bunch of ways to modify this) the whole program is applied to each succeeding line of input.

The following is more of a hypothetical scenario than a general description of awk. awk is a full programming language which is geared toward text/string manipulation. This makes it ideal for filtering/checking/transforming input in various ways.

You can have a clause that says, "does this line have (part of) what I'm looking for in it". If it doesn't you can just output the line as is or skip to the next line.

A subsequent clause could start with the assumption that the first clause already succeeded - or you wouldn't have gotten here.
For the purposes of this discussion, this is the nice part because your first clause solved one small piece of the problem and now your second clause can solve another small piece, thus breaking a complex problem into smaller more manageable and debugable sub-tasks.

Here, you don't need forward and back references because you can just find things and store them in simple variables and examine them at will.

Because awk is a line-at-a-time language, you have to do a bit more work if the the thing you're working with spans more than one line, but it's just a matter of saving a few more things in variables.

You don't need it here, but one of my favorite things about awk is its associative arrays. (It might have been the first language to offer these. [ducking and covering]) You can do things like my_table["grapes"] = 10 . You can put in whatever you want and get them back out in various ways.

The main point of all of this rambling is that, IMHO, awk programming is easy and quick compared to many other languages.

It's not intended for writing huge programs like compilers (although I believe it's been done), but for short to medium programs for processing text, it's hard to beat.
 
1 members found this post helpful.
Old 07-19-2017, 02:14 PM   #11
FranekW
Member
 
Registered: Apr 2017
Distribution: Manjaro and Ubuntu
Posts: 30

Original Poster
Rep: Reputation: Disabled
I understand now. Thanks for clarifying this for me

As soon as I have a little more time, I am going to try awk.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] jhalfs sed: -e expression #1, char 55:Invalid preceding regular expression percy_vere_uk Linux From Scratch 10 07-22-2017 07:15 AM
Regular Expression 0.o Programming 3 06-09-2009 02:28 AM
regular expression Ammad Linux - General 5 08-01-2008 07:41 AM
Do we have regular expression in C++ ? indian Programming 4 03-06-2006 09:54 AM
Regular Expression slizadel Programming 4 07-28-2003 05:16 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:39 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration