I don't know what you mean by look behinds. There is a man 7 regex manpage that covers posix regular expressions. One thing that tripped me up once is that the locale setting can alter regular expression matches. I was having '[[:upper:]]' being equivalent to '[[:lower:]]'.
I use sed to extract the names of files that I backed before I delete them to free up more space.
The *.k3b file is a zip archive of two files, maindata.xml and mimetype.
I use the following to extract the filenames from lines that match a <url>.*<\/url> pattern:
Code:
sed -e '/^<url>/!d' \
-e 's/^<url>\(.*\)<\/url>/\1/' maindata.xml | tr '\n' '\000' | xargs -0 rm
In your case, you need to verify that the '<title>.*</title>' pattern always is on the same line. Sed programs get more complicated if they don't and you would be better off writing a sed script instead of using a one-liner.
Code:
sed -n '/<title>.*<\/title>/s/^.*<title>\(.*\)<\/title>.*$/p' file.xml
The -n option only outputs lines if there is a print command. I use this option when I just want to extract certain information from a file.
The '/<title>.*</title>/' part selects just lines that contain your title information.
The "\(" "\)" parts save the information in between, and the "<title>" and "<\/title>" parts serve as anchors so that the correct part of the line is saved. The "\1" part in the replacement replaces the entire line with what was saved earlier.
If the title comes at the beginning of large documents, and you want to process a number of them in a loop, consider using the :q command to stop after the pattern is found. The sed info pages will have examples.
For the sed program, I would highly recommend downloading the source and using it to produce a PDF or PS version of the manual. The source for gawk also contains a book: "Gawk: Effective Awk Programming". Normally there is a make target to produce this documentation: "make pdf" or "make ps".