Extarct tags with multiline values from XML file using sed/Awk

gbms · 03-27-2012, 03:39 AM

Hi,

I have some XML file which holds data-value pairs(basically, a Java properties file in XML) as shown below.
This file contains both single line tags and multiline tags.

<entry key="KEY1"> tag1 value </entry>
<entry key="KEY2" > hello
world. This is multiline tag example.
blahh blah blah...
</entry>

I want to extract the tag value by passing tag the name from bash script.
Could somebody give me some pointers to extract multiline value of a tag ?

Thanks,
gbms

grail · 03-27-2012, 04:57 AM

This might get you going:

Code:

awk '{print "|"$0"|"}' RS="[<>\n]+" file

Generally though your probably better off with Perl or Ruby as they have xml parsers which they can use.

catkin · 03-27-2012, 05:37 AM

XMLStarlet has been recommended on LQ. I haven't needed to use it yet so cannot say how good it is etc.

David the H. · 03-27-2012, 10:18 AM

xml and html data structures are (generally) free-form in terms of whitespace and can contain nested values, both of which are difficult-to-impossible for regular expression and line-based programs like sed or awk to parse reliably.

So unless your extraction requirements are trivial and the input is guaranteed to be well-formed and uniform, you're much better off working with tools specifically designed for those languages, as suggested above.

xmlstarlet is probably a good place to start. Like catkin, I don't know much about it personally, but it has a good set of documentation here:

http://xmlstar.sourceforge.net/docs.php

Also, please use [code][/code] tags around your code and data, to preserve formatting and to improve readability. Please do not use quote tags, colors, or other fancy formatting.