LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   egrep/grep regex question (https://www.linuxquestions.org/questions/programming-9/egrep-grep-regex-question-442320/)

buldir 05-06-2006 06:54 PM

egrep/grep regex question
 
I used a program called Expresso to build a regex to parse out the text between xml tags:
Code:

(?<=<title>).*(?=</title>)
which works in the program. But when I implement the regex into a Unix script, grep and egrep give no results. The line I use is:
Code:

egrep '(?<=<title>).*(?=</title>)' test.xml
with a line in the xml being:
Code:

<title>Some Title Here</title>
When I lop off the lookahead and lookbehind, however:
Code:

egrep '(<title>).*(</title>)' test.xml
it finds:
Code:

<title>Some Title Here</title>
It seems that egrep and grep do not support a full regular expression inside a lookbehind. A workaround I have come up with is:
Code:

egrep -m 1 '<title>' test.xml | sed 's/<[^<]*>//g'
which grabs the first instance of title (the file contains many) with "egrep -m 1" and chops off the xml tags with sed. The problem is not all versions of egrep support the -m flag and I would like to have this work on Solaris 9 machines as well. Can anyone suggest a better regex or egrep code to parse out the text between given xml tags?

jschiwal 05-06-2006 08:19 PM

I don't know what you mean by look behinds. There is a man 7 regex manpage that covers posix regular expressions. One thing that tripped me up once is that the locale setting can alter regular expression matches. I was having '[[:upper:]]' being equivalent to '[[:lower:]]'.

I use sed to extract the names of files that I backed before I delete them to free up more space.
The *.k3b file is a zip archive of two files, maindata.xml and mimetype.
I use the following to extract the filenames from lines that match a <url>.*<\/url> pattern:
Code:

sed -e '/^<url>/!d' \
    -e 's/^<url>\(.*\)<\/url>/\1/' maindata.xml | tr '\n' '\000' | xargs -0 rm

In your case, you need to verify that the '<title>.*</title>' pattern always is on the same line. Sed programs get more complicated if they don't and you would be better off writing a sed script instead of using a one-liner.
Code:

sed -n  '/<title>.*<\/title>/s/^.*<title>\(.*\)<\/title>.*$/p' file.xml
The -n option only outputs lines if there is a print command. I use this option when I just want to extract certain information from a file.
The '/<title>.*</title>/' part selects just lines that contain your title information.
The "\(" "\)" parts save the information in between, and the "<title>" and "<\/title>" parts serve as anchors so that the correct part of the line is saved. The "\1" part in the replacement replaces the entire line with what was saved earlier.
If the title comes at the beginning of large documents, and you want to process a number of them in a loop, consider using the :q command to stop after the pattern is found. The sed info pages will have examples.

For the sed program, I would highly recommend downloading the source and using it to produce a PDF or PS version of the manual. The source for gawk also contains a book: "Gawk: Effective Awk Programming". Normally there is a make target to produce this documentation: "make pdf" or "make ps".

jschiwal 05-06-2006 08:47 PM

Correction: The gawk.ps and gawk.dvi came from a "gawk-doc" package.
The "dvi2pdf" program could be used if you want to produce a pdf document.

jim mcnamara 05-06-2006 10:51 PM

Lookahead and lookbehind (the "(?" syntax )are supported by some regex libraries.

There is no "standard" for REGEX other than the POSIX standard.
Even then, almost every regex engine has unique differences. This stems from the fact that everybody in the past wrote improvements into whatever they worked on.
They got to define what improvement meant. Like C compiler writers adding non-standard C library routines as you see in MS C++.

In general, you can expect added syntax and behaviors for a lot of newer programs. grep and egrep are older. You've gotta know ahead of time what flavor of regex you've got.

jschiwal 05-07-2006 12:35 AM

The regular expressions in perl are more comprehensive. The documentation for gnu's sed points out when a feature might not be compatible with older sed programs. And of course, when it comes to processing patterns, there's LISP!

buldir 05-07-2006 02:34 PM

Thanks
 
Thanks for your comments and suggestions, jschiwal and jim. I think I'll look into using perl to handle the regex parsing. Most of the Google results for "parse text between xml tags" included references to perl.

bigearsbilly 05-08-2006 08:19 AM

Isn't it an XSLT job?

buldir 05-08-2006 07:17 PM

Quote:

Originally Posted by bigearsbilly
Isn't it an XSLT job?

Yes, I think you're right. I need to look into obtaining the libxml and libxslt binaries for Solaris 9/10.


All times are GMT -5. The time now is 03:35 PM.