LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 05-06-2006, 06:54 PM   #1
buldir
Member
 
Registered: Mar 2004
Location: Fairbanks, AK USA
Posts: 135

Rep: Reputation: 15
egrep/grep regex question


I used a program called Expresso to build a regex to parse out the text between xml tags:
Code:
(?<=<title>).*(?=</title>)
which works in the program. But when I implement the regex into a Unix script, grep and egrep give no results. The line I use is:
Code:
egrep '(?<=<title>).*(?=</title>)' test.xml
with a line in the xml being:
Code:
<title>Some Title Here</title>
When I lop off the lookahead and lookbehind, however:
Code:
egrep '(<title>).*(</title>)' test.xml
it finds:
Code:
<title>Some Title Here</title>
It seems that egrep and grep do not support a full regular expression inside a lookbehind. A workaround I have come up with is:
Code:
egrep -m 1 '<title>' test.xml | sed 's/<[^<]*>//g'
which grabs the first instance of title (the file contains many) with "egrep -m 1" and chops off the xml tags with sed. The problem is not all versions of egrep support the -m flag and I would like to have this work on Solaris 9 machines as well. Can anyone suggest a better regex or egrep code to parse out the text between given xml tags?
 
Old 05-06-2006, 08:19 PM   #2
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
I don't know what you mean by look behinds. There is a man 7 regex manpage that covers posix regular expressions. One thing that tripped me up once is that the locale setting can alter regular expression matches. I was having '[[:upper:]]' being equivalent to '[[:lower:]]'.

I use sed to extract the names of files that I backed before I delete them to free up more space.
The *.k3b file is a zip archive of two files, maindata.xml and mimetype.
I use the following to extract the filenames from lines that match a <url>.*<\/url> pattern:
Code:
sed -e '/^<url>/!d' \
    -e 's/^<url>\(.*\)<\/url>/\1/' maindata.xml | tr '\n' '\000' | xargs -0 rm
In your case, you need to verify that the '<title>.*</title>' pattern always is on the same line. Sed programs get more complicated if they don't and you would be better off writing a sed script instead of using a one-liner.
Code:
sed -n  '/<title>.*<\/title>/s/^.*<title>\(.*\)<\/title>.*$/p' file.xml
The -n option only outputs lines if there is a print command. I use this option when I just want to extract certain information from a file.
The '/<title>.*</title>/' part selects just lines that contain your title information.
The "\(" "\)" parts save the information in between, and the "<title>" and "<\/title>" parts serve as anchors so that the correct part of the line is saved. The "\1" part in the replacement replaces the entire line with what was saved earlier.
If the title comes at the beginning of large documents, and you want to process a number of them in a loop, consider using the :q command to stop after the pattern is found. The sed info pages will have examples.

For the sed program, I would highly recommend downloading the source and using it to produce a PDF or PS version of the manual. The source for gawk also contains a book: "Gawk: Effective Awk Programming". Normally there is a make target to produce this documentation: "make pdf" or "make ps".
 
Old 05-06-2006, 08:47 PM   #3
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Correction: The gawk.ps and gawk.dvi came from a "gawk-doc" package.
The "dvi2pdf" program could be used if you want to produce a pdf document.

Last edited by jschiwal; 05-06-2006 at 08:53 PM.
 
Old 05-06-2006, 10:51 PM   #4
jim mcnamara
Member
 
Registered: May 2002
Posts: 964

Rep: Reputation: 34
Lookahead and lookbehind (the "(?" syntax )are supported by some regex libraries.

There is no "standard" for REGEX other than the POSIX standard.
Even then, almost every regex engine has unique differences. This stems from the fact that everybody in the past wrote improvements into whatever they worked on.
They got to define what improvement meant. Like C compiler writers adding non-standard C library routines as you see in MS C++.

In general, you can expect added syntax and behaviors for a lot of newer programs. grep and egrep are older. You've gotta know ahead of time what flavor of regex you've got.
 
Old 05-07-2006, 12:35 AM   #5
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
The regular expressions in perl are more comprehensive. The documentation for gnu's sed points out when a feature might not be compatible with older sed programs. And of course, when it comes to processing patterns, there's LISP!

Last edited by jschiwal; 05-07-2006 at 12:38 AM.
 
Old 05-07-2006, 02:34 PM   #6
buldir
Member
 
Registered: Mar 2004
Location: Fairbanks, AK USA
Posts: 135

Original Poster
Rep: Reputation: 15
Thanks

Thanks for your comments and suggestions, jschiwal and jim. I think I'll look into using perl to handle the regex parsing. Most of the Google results for "parse text between xml tags" included references to perl.
 
Old 05-08-2006, 08:19 AM   #7
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,276

Rep: Reputation: 170Reputation: 170
Isn't it an XSLT job?
 
Old 05-08-2006, 07:17 PM   #8
buldir
Member
 
Registered: Mar 2004
Location: Fairbanks, AK USA
Posts: 135

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by bigearsbilly
Isn't it an XSLT job?
Yes, I think you're right. I need to look into obtaining the libxml and libxslt binaries for Solaris 9/10.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
grep/egrep logical AND function? boozer_2 Linux - Newbie 11 04-10-2010 01:19 AM
Problem matching strings with grep/egrep Seb74 Linux - Newbie 5 05-26-2005 01:40 PM
egrep question internal_war Linux - Newbie 6 05-05-2005 06:32 PM
Using Grep and Egrep linux-nerd Linux - General 5 10-10-2004 11:37 AM
grep (possibly regex) question. mwtheobald Linux - Newbie 1 08-17-2002 03:05 PM


All times are GMT -5. The time now is 10:29 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration