Please use ***
[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do
not use quote tags, bolding, colors, "start/end" lines, or other creative techniques.
Line and regex based tools like
grep/
sed/
awk are not well suited for parsing xml's nested, free-form structure. It's usually better to use a tool that has a dedicated xml parser.
Perl, which you mentioned, has xml modules available, but I'm not that familiar with it. I can only show you the one I know, which is
xmlstarlet.
I got the following to work on the above example (after closing out the catalog tag):
Code:
$ xmlstarlet sel -T -t -f -v 'concat(":<author>",//book[1]/author,"</author>~")' -v 'concat("<title>",//book[1]/title,"</title>")' -n infile.xml
infile.xml:<author>Gambardella, Matthew</author>~<title>XML Developer's Guide</title>
To break it down,
sel is the command for extraction.
-T outputs plain text, and
-t starts the template command. Inside the template,
-f prints the filename, the two
-v commands print the extracted values, and
-n adds a newline at the end.
concat is an
xpath function that combines text strings together. "
//book[1]/author" extracts the value of the author tag inside the first book tag. Same goes with the title. The text strings on either side reconstruct the tag brackets and the delimiters around them.
There may be a way to print the whole entry directly, but I'm not familiar enough with it myself to know how. Also,
xmlstarlet insists on well-formed xml input, so you may need to clean up the formatting first.
Or as another option. try using the
pyx command, which converts the xml into a line-based representation that you can more safely parse with
sed or
awk.
There's also a tool in the html-xml-utils package called
hxpipe which can print out a similar line-based format, and it's a bit more robust on the input it can handle.