egrep/grep regex question
I used a program called Expresso to build a regex to parse out the text between xml tags:
Code:
(?<=<title>).*(?=</title>) Code:
egrep '(?<=<title>).*(?=</title>)' test.xml Code:
<title>Some Title Here</title> Code:
egrep '(<title>).*(</title>)' test.xml Code:
<title>Some Title Here</title> Code:
egrep -m 1 '<title>' test.xml | sed 's/<[^<]*>//g' |
I don't know what you mean by look behinds. There is a man 7 regex manpage that covers posix regular expressions. One thing that tripped me up once is that the locale setting can alter regular expression matches. I was having '[[:upper:]]' being equivalent to '[[:lower:]]'.
I use sed to extract the names of files that I backed before I delete them to free up more space. The *.k3b file is a zip archive of two files, maindata.xml and mimetype. I use the following to extract the filenames from lines that match a <url>.*<\/url> pattern: Code:
sed -e '/^<url>/!d' \ Code:
sed -n '/<title>.*<\/title>/s/^.*<title>\(.*\)<\/title>.*$/p' file.xml The '/<title>.*</title>/' part selects just lines that contain your title information. The "\(" "\)" parts save the information in between, and the "<title>" and "<\/title>" parts serve as anchors so that the correct part of the line is saved. The "\1" part in the replacement replaces the entire line with what was saved earlier. If the title comes at the beginning of large documents, and you want to process a number of them in a loop, consider using the :q command to stop after the pattern is found. The sed info pages will have examples. For the sed program, I would highly recommend downloading the source and using it to produce a PDF or PS version of the manual. The source for gawk also contains a book: "Gawk: Effective Awk Programming". Normally there is a make target to produce this documentation: "make pdf" or "make ps". |
Correction: The gawk.ps and gawk.dvi came from a "gawk-doc" package.
The "dvi2pdf" program could be used if you want to produce a pdf document. |
Lookahead and lookbehind (the "(?" syntax )are supported by some regex libraries.
There is no "standard" for REGEX other than the POSIX standard. Even then, almost every regex engine has unique differences. This stems from the fact that everybody in the past wrote improvements into whatever they worked on. They got to define what improvement meant. Like C compiler writers adding non-standard C library routines as you see in MS C++. In general, you can expect added syntax and behaviors for a lot of newer programs. grep and egrep are older. You've gotta know ahead of time what flavor of regex you've got. |
The regular expressions in perl are more comprehensive. The documentation for gnu's sed points out when a feature might not be compatible with older sed programs. And of course, when it comes to processing patterns, there's LISP!
|
Thanks
Thanks for your comments and suggestions, jschiwal and jim. I think I'll look into using perl to handle the regex parsing. Most of the Google results for "parse text between xml tags" included references to perl.
|
Isn't it an XSLT job?
|
Quote:
|
All times are GMT -5. The time now is 03:35 PM. |