LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   How do I search for for text between difficult strings, like"yellow"/></w:rP..." (https://www.linuxquestions.org/questions/programming-9/how-do-i-search-for-for-text-between-difficult-strings-like-yellow-w-rp-821328/)

dsayars 07-21-2010 03:17 PM

How do I search for for text between difficult strings, like"yellow"/></w:rP..."
 
I am trying to capture and change text between two keyword strings in Perl, but the keyword strings are fragments of XML, mostly non-word characters. Thanks to replies to earlier posts, I managed to do it using real words, like "start" and "end" For example, this works.

while ($text=~ s/\bstart\b (.*?) \bend\b//) {
print $1, "\n";
$term = "term_$1_term";
}

However, my actual start keyword string is "yellow"/></w:rPr><w:t>VAR</w:t></w:r>" and the closing string is "<". Using lots of escape characters, I can find the two keywords like this:

$text=~m/yellow\"\/\>\<\/w:rPr\>\<w:t\>/;
$text=~m/\>/;

But substituting these strings for "start" and "end" in the while statement (and removing the \b character) doesn't work. In other words, this doesn't work:

while ($text=~s /yellow\"\/\>\<\/w:rPr\>\<w:t\>(.*?) \<//) {
print $1, "\n";
$term = "term_$1_term";
}

Can anyone tell me how to put these resistant strings into the while statement?

theNbomr 07-21-2010 05:18 PM

The number of meta-characters to quote is large, and difficult to read. Consider something like (untested):
Code:

    my $regexStart = quotemeta '"yellow"/></w:rPr><w:t>VAR</w:t></w:r>"';  # not sure which double-quotes you actually need, here
    my $regexEnd  = quotemeta '"<"';

    while( $text =~ s/$regexStart(.*?)$regexEnd// ){
        print $1,"\n";
        my $term = "term_$1_term";
    }

Also, I'm not sure your regex '(.*?)' really does what you want (because I'm not sure what you are trying to match). Perhaps you really want '(.+)' I can't actually think of how three meta-characters like .*? matches anything.

--- rod

Sergei Steshenko 07-21-2010 06:39 PM

Quote:

Originally Posted by dsayars (Post 4041086)
I am trying to capture and change text between two keyword strings in Perl, but the keyword strings are fragments of XML, mostly non-word characters. Thanks to replies to earlier posts, I managed to do it using real words, like "start" and "end" For example, this works.

while ($text=~ s/\bstart\b (.*?) \bend\b//) {
print $1, "\n";
$term = "term_$1_term";
}

However, my actual start keyword string is "yellow"/></w:rPr><w:t>VAR</w:t></w:r>" and the closing string is "<". Using lots of escape characters, I can find the two keywords like this:

$text=~m/yellow\"\/\>\<\/w:rPr\>\<w:t\>/;
$text=~m/\>/;

But substituting these strings for "start" and "end" in the while statement (and removing the \b character) doesn't work. In other words, this doesn't work:

while ($text=~s /yellow\"\/\>\<\/w:rPr\>\<w:t\>(.*?) \<//) {
print $1, "\n";
$term = "term_$1_term";
}

Can anyone tell me how to put these resistant strings into the while statement?

I don't remember - have I already given the http://docstore.mik.ua/orelly/perl/cookbook/ch06_09.htm link ?

Anyway, if you are parsing HTML - just don't. I.e. use an existing parser: http://search.cpan.org/search?query=...arser&mode=all .

And, I think, I've already given http://search.cpan.org/~adamk/Text-B...xt/Balanced.pm - this module is a generic solution for whatever BEGIN .. END markers - not just plain words

dsayars 07-22-2010 01:13 AM

Quote:

Originally Posted by theNbomr (Post 4041183)
The number of meta-characters to quote is large, and difficult to read. Consider something like (untested):
Code:

    my $regexStart = quotemeta '"yellow"/></w:rPr><w:t>VAR</w:t></w:r>"';  # not sure which double-quotes you actually need, here
    my $regexEnd  = quotemeta '"<"';

    while( $text =~ s/$regexStart(.*?)$regexEnd// ){
        print $1,"\n";
        my $term = "term_$1_term";
    }

Also, I'm not sure your regex '(.*?)' really does what you want (because I'm not sure what you are trying to match). Perhaps you really want '(.+)' I can't actually think of how three meta-characters like .*? matches anything.

--- rod

Thanks. Problem now solved (this one at least) as per my update to my original question. Regarding (.*?), I got these from a general solution I found in another thread on this site, which was

while ($text=~ s/\bstart\b (.*?) \bend\b//)

Don't have a lot of insight into why this works, but it does.

dsayars 07-22-2010 01:19 AM

Quote:

Originally Posted by Sergei Steshenko (Post 4041228)
I don't remember - have I already given the http://docstore.mik.ua/orelly/perl/cookbook/ch06_09.htm link ?

Anyway, if you are parsing HTML - just don't. I.e. use an existing parser: http://search.cpan.org/search?query=...arser&mode=all .

And, I think, I've already given http://search.cpan.org/~adamk/Text-B...xt/Balanced.pm - this module is a generic solution for whatever BEGIN .. END markers - not just plain words

Thanks. I got confused by the references you gave me before, on my earlier post. This is Word XML, not HTML, but I'll see if these modules help.

Sergei Steshenko 07-22-2010 01:30 AM

Quote:

Originally Posted by dsayars (Post 4041495)
Thanks. I got confused by the references you gave me before, on my earlier post. This is Word XML, not HTML, but I'll see if these modules help.

http://search.cpan.org/search?query=xml+parser&mode=all - just don't reinvent the wheel.


All times are GMT -5. The time now is 03:02 PM.