LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 11-18-2009, 03:15 AM   #1
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Rep: Reputation: 282Reputation: 282Reputation: 282
Regular Expressions


I'm trying to find a regular expression that can validate and parse a string that does not have a fixed number of fields.
Code:
Bruce Willis,Richard Gere
Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana
I've managed to get the validation going while typing this message using the following RE
Code:
^([A-Za-z ]+)([,]([A-Za-z ]+))*$
Unfortunately this RE does not parse properly and I don't know how to get that right.
The current result is (for the second string)
Code:
Total match:  'Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana'
First group:  'Pink Floyd'
Second group: ',Santana'
Third group:  'Santana'
What I want to get out (for the second string) is
Code:
Total match:  'Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana'
First group:  'Pink Floyd'
Second group: 'Deep Purple'
Third group:  'Uriah Heep'
Fourth group: 'Ten Years After'
Fifth group:  'Santana'
1) Is there any way to achieve this using regular expressions?

And another question. I use Tcl and Java (just started) and found out that the result of an regexp can be significantly different.

Tcl returns 'Pink Floyd' for the match and Java returns 'Santana' when using ([A-Za-z ]+) as a regular expression on the second string.

TCL code
Code:
set match [regexp $regexp $text matchstr group1 group2 group3 group4 group5 group6 group7 group8 group9 group10]
Java code
Code:
        Pattern p;
        Matcher m;
        try {
             p = Pattern.compile(regexp);
        }
        catch (PatternSyntaxException ePatternSyntaxException) {
            String Error = "" + ePatternSyntaxException;
            JOptionPane.showMessageDialog(null, Error, "Regular expression", JOptionPane.ERROR_MESSAGE);
            return;
        }

        m = p.matcher(text);
        int start = 0;
        while (m.find(start) == true)
        {
            resultTextArea.setText("Group cnt : " + Integer.toString(m.groupCount()) + "\n");
            for (int i=0; i<=m.groupCount(); i++) {
                if (m.group(i) != null) {
                    resultTextArea.append("Group " + Integer.toString(i) + " : '" + m.group(i) + "' (" + m.start(i) + "," + m.end(i) + ")\n");
                }
            }
            start = m.end();
            resultTextArea.append("----\n");
        }
2) Is this a coding issue in my code or a difference in implementation in the language (I'm aware that there is something like Posix and Perl implementations).
 
Old 11-18-2009, 03:31 AM   #2
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
Code:
Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana
if you have structured data like that and all you ever want to get is those words separated by ",", use fields/delimiter method, NOT regular expression. Depending on what language you are using, there will string splitting methods that can split a string into tokens using a delimiter. check your language documentation.
 
Old 11-18-2009, 04:11 AM   #3
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,283

Rep: Reputation: 172Reputation: 172
oh my god isn't java UGLY!
ghostdog is right, use split which splits into a tcl list.


Code:
#!/usr/bin/env tclsh

set n 0

while { [gets stdin line] >= 0 }  {
    set n 0
    set list [ split $line ,]
    puts "Total match:$line"
    foreach name $list {
        incr n
        puts stdout "\titem $n is: $name"
    }
    puts ""
}
Code:
$ ./1.tcl < 1
Total match:Bruce Willis,Richard Gere
        item 1 is: Bruce Willis
        item 2 is: Richard Gere

Total match:Pink Floyd,Deep Purple,Uriah Heep,Ten Years After,Santana
        item 1 is: Pink Floyd
        item 2 is: Deep Purple
        item 3 is: Uriah Heep
        item 4 is: Ten Years After
        item 5 is: Santana
I like tcl

p.s. showing your age with that music

Last edited by bigearsbilly; 11-18-2009 at 04:12 AM.
 
Old 11-18-2009, 04:42 AM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,127

Rep: Reputation: 985Reputation: 985Reputation: 985Reputation: 985Reputation: 985Reputation: 985Reputation: 985Reputation: 985
Yeah - the kids of today ...
 
Old 11-18-2009, 09:04 AM   #5
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Original Poster
Rep: Reputation: 282Reputation: 282Reputation: 282
Thanks for the replies. The regular expression implementations in Tcl and Java and possibly in other languages make it possible to parse the data into individual 'blobs'. So why not use it if it's possible? After all, I'm a lazy guy

@bigearsbilly
We (or at least I) know you like Tcl and you're not the only one Till now I have managed to write all my applications that needed a GUI in Tcl/Tk. I'm now unfortunately forced to look at Java.

Last edited by Wim Sturkenboom; 11-18-2009 at 09:05 AM.
 
Old 11-18-2009, 09:25 AM   #6
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,283

Rep: Reputation: 172Reputation: 172
that's progress (haha)

at least you have a job I guess.
I ain't had anything since february
:-(
 
Old 11-18-2009, 09:55 AM   #7
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Original Poster
Rep: Reputation: 282Reputation: 282Reputation: 282
Sorry to hear about the job. I have one but can't be paid; living on my savings for the last 4 months.
 
Old 11-18-2009, 04:38 PM   #8
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,396

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
The java api can't give a variable number of blobs, see Groups and capturing.

Quote:
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails.
Quote:
ghostdog is right, use split which splits into a tcl list.
Java has split also.
 
Old 11-18-2009, 07:54 PM   #9
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,242

Rep: Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024
1. regex variations; it is indeed true that different langs/tools often have slightly different regex engines. The best book that explains regexes and differences is here http://regex.info//
2. interesting music
3. BB you have my sympathy, had 8 mths out during the GFC
 
Old 11-19-2009, 12:30 AM   #10
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Original Poster
Rep: Reputation: 282Reputation: 282Reputation: 282
(Part of) the problem is my java implementation; resultTextArea.setText clears the textarea which hides data from previous iterations in the while loop I finally figured that out as an 'incorrect' but valid regular expression caused the program to become unresponsive (meaning it ended in an endless loop); so I added a loop counter and with that I only saw the last result.

The revised code
Code:
        // clear textarea
        resultTextArea.setText(null);

        m = p.matcher(text);
        int start = 0;
        int loopcnt=1;
        while (m.find(start) == true)
        {
            resultTextArea.append("Loop : " + Integer.toString(loopcnt) + "\n");
            resultTextArea.append("Group cnt : " + Integer.toString(m.groupCount()) + "\n");
            for (int i=0; i<=m.groupCount(); i++) {
                if (m.group(i) != null) {
                    resultTextArea.append("Group " + Integer.toString(i) + " : '" + m.group(i) + "' (" + m.start(i) + "," + m.end(i) + ")\n");
                }
            }
            start = m.end();
            resultTextArea.append("----\n");
            loopcnt++;
            // stop when we have a megabyte of data in the textarea
            if (loopcnt>1000) {
                resultTextArea.append("Aborting ... ");
                break;
            }
        }
        resultTextArea.append("DONE\n");
Using the ([A-Za-z ]+),* as the regular expression will now give the following result for the bands:
Code:
Loop : 1
Group cnt : 1
Group 0 : 'Pink Floyd,' (0,11)
Group 1 : 'Pink Floyd' (0,10)
----
Loop : 2
Group cnt : 1
Group 0 : 'Deep Purple,' (11,23)
Group 1 : 'Deep Purple' (11,22)
----
Loop : 3
Group cnt : 1
Group 0 : 'Uriah Heep,' (23,34)
Group 1 : 'Uriah Heep' (23,33)
----
Loop : 4
Group cnt : 1
Group 0 : 'Ten Years After,' (34,50)
Group 1 : 'Ten Years After' (34,49)
----
Loop : 5
Group cnt : 1
Group 0 : 'Santana' (50,57)
Group 1 : 'Santana' (50,57)
----
DONE
Knowing that group 0 is always the actual match and group 1 (and higher) are the groups, I think that this issue is solvable in Java for my purposes.

I like to thank everybody for their replies.

A possibly a useful link: Regular Expression Playground

And the lesson learned: what you see is not what you get.
 
Old 11-19-2009, 01:21 AM   #11
Wim Sturkenboom
Senior Member
 
Registered: Jan 2005
Location: Roodepoort, South Africa
Distribution: Slackware 10.1/10.2/12, Ubuntu 12.04, Crunchbang Statler
Posts: 3,786

Original Poster
Rep: Reputation: 282Reputation: 282Reputation: 282
OOPS, spoke slightly to early. It works as a parser but no longer as a validator
 
  


Reply

Tags
groups, java, regular expressions, retrieve, tcl


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
help with regular expressions mariogarcia Linux - Software 3 01-28-2009 03:23 AM
\{a,b\} regular expressions sycamorex Linux - General 10 10-18-2008 06:38 PM
regular expressions. stomach Linux - Software 1 02-10-2006 06:41 AM
regular expressions? alaios Linux - General 2 06-11-2003 03:51 PM


All times are GMT -5. The time now is 12:27 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration