Please use ***[code][/code]
*** tags around your code and data, to preserve the original formatting and to improve readability. Do not
use quote tags, bolding, colors, "start/end" lines, or other creative techniques.
Line and regex based tools like grep
are not well suited for parsing xml's nested, free-form structure. It's usually better to use a tool that has a dedicated xml parser.
Perl, which you mentioned, has xml modules available, but I'm not that familiar with it. I can only show you the one I know, which is xmlstarlet
I got the following to work on the above example (after closing out the catalog tag):
$ xmlstarlet sel -T -t -f -v 'concat(":<author>",//book/author,"</author>~")' -v 'concat("<title>",//book/title,"</title>")' -n infile.xml
infile.xml:<author>Gambardella, Matthew</author>~<title>XML Developer's Guide</title>
To break it down, sel
is the command for extraction. -T
outputs plain text, and -t
starts the template command. Inside the template, -f
prints the filename, the two -v
commands print the extracted values, and -n
adds a newline at the end.
is an xpath
function that combines text strings together. "//book/author
" extracts the value of the author tag inside the first book tag. Same goes with the title. The text strings on either side reconstruct the tag brackets and the delimiters around them.
There may be a way to print the whole entry directly, but I'm not familiar enough with it myself to know how. Also, xmlstarlet
insists on well-formed xml input, so you may need to clean up the formatting first.
Or as another option. try using the pyx
command, which converts the xml into a line-based representation that you can more safely parse with sed
There's also a tool in the html-xml-utils package called hxpipe
which can print out a similar line-based format, and it's a bit more robust on the input it can handle.