LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Python scriptutil for patterns that don't match (http://www.linuxquestions.org/questions/programming-9/python-scriptutil-for-patterns-that-dont-match-708450/)

donnied 03-01-2009 05:02 PM

Python scriptutil for patterns that don't match
 
I was trying to think of a way to use scriptutil.freplace to delete lines that don't match a patter. For the moment I've had to settle with a short if, then statement. However, I feel I would be a lot better off if I knew how to do a 'does not' match search.
I'm having a small problem with parentheses and brackets.
I have: ([0-9]{6}),([0-9]{4}[A-Z]{0,2}),([0-9]{1,3}),(.*?),(.*?),([0-9]{1,3}),(.*?),(.*)
I would like to eliminate patterns --for example those that don't have all eight comma delimited fields.
What about something like
[^(.*?,)(.*?,)(.*?,)(.*?,)(.*?,)(.*?,)(.*?,)(.*?,)(.*?)] ?
Or how would I include the fields I've already specified:
[0-9]{6}),([0-9]{4}[A-Z]{0,2}),([0-9]{1,3}) etc?

ghostdog74 03-03-2009 02:19 AM

you don't have to make it that complicated if you are using Python. Show samples of text you want to match, and describe more clearly what you want to get.

donnied 03-03-2009 06:29 PM

Here I'm thinking specifically I don't want lined that don't have eight different fields separated by commas. Or possibly if the sixth field is not a number.
I think knowing how to state [^foo] could be helpful.

I'm also curious why if I specify seven fields
(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?) and replace with \1\2\3\4\5\6 why the 7th field is tagged on? To delete the 7th field I used the back references to insert foo between \6 and \7 and delete what cam after foo. This seems a bit unnecessary and I don't remember scriptutil always behaving this way.

beiller 03-03-2009 09:03 PM

Hey There
 
Dunno why you would use (.*?),(.*?)

* and ? are both modifiers. (.*),(.*) is more accurate, as the * is the kleene star ;) It means 0 or more, which kind of implies its optional (?). Correct me if I am wrong...

yassen 03-04-2009 04:10 AM

donnied:
Here's an advice (not a direct reply to your question, but you might find this VERY useful, as I did):

Dowbload this regexp editor-tester app (QuickREx):

http://sourceforge.net/projects/quickrex/

If you do not have/use eclipse, download the stand-alone application. You need to have Java installed to get that running, and possibly the java ./bin/ directory added to your PATH.

If you get it running, there comes the fun part: paste some test lines of your input data and write a regular expression to the corresponding text field; it will immediately show you in real time if it matches, which are the groups, etc. Works great for me, and the "JDK regexp" seems to completely match the Python regexp behavior.

Hope this will help you,
Cheers!
yassen

yassen 03-04-2009 04:35 AM

And also, how about skipping ('continue' in the loop) lines that have line.count(',') != 8?

donnied 03-04-2009 05:52 PM

Quote:

Originally Posted by beiller (Post 3464030)
Dunno why you would use (.*?),(.*?)

* and ? are both modifiers. (.*),(.*) is more accurate, as the * is the kleene star ;) It means 0 or more, which kind of implies its optional (?). Correct me if I am wrong...

It's my understanding that the '?' is used for the 'non-greedy' regex and it matches the limits itself to the first occurrence of a pattern. If there is a comma it will stop at the first instance and not go beyond whereas .* could include anything that goes up to a comma (including other commas).

donnied 03-04-2009 05:53 PM

Quote:

Originally Posted by yassen (Post 3464362)
And also, how about skipping ('continue' in the loop) lines that have line.count(',') != 8?

Yeah, that's pretty much what I did. I was hoping to skip the 'if,then' loop.

beiller 03-05-2009 12:06 PM

Match all 8 comma separated fields
 
Yes realized that ? is non-greedy. doh

maybe ^.*?(,.*?){7}$


All times are GMT -5. The time now is 04:10 AM.