LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Extarct tags with multiline values from XML file using sed/Awk (https://www.linuxquestions.org/questions/linux-newbie-8/extarct-tags-with-multiline-values-from-xml-file-using-sed-awk-936629/)

gbms 03-27-2012 03:39 AM

Extarct tags with multiline values from XML file using sed/Awk
 
Hi,

I have some XML file which holds data-value pairs(basically, a Java properties file in XML) as shown below.
This file contains both single line tags and multiline tags.

<entry key="KEY1"> tag1 value </entry>
<entry key="KEY2" > hello
world. This is multiline tag example.
blahh blah blah...
</entry>

I want to extract the tag value by passing tag the name from bash script.
Could somebody give me some pointers to extract multiline value of a tag ?



Thanks,
gbms

grail 03-27-2012 04:57 AM

This might get you going:
Code:

awk '{print "|"$0"|"}' RS="[<>\n]+" file
Generally though your probably better off with Perl or Ruby as they have xml parsers which they can use.

catkin 03-27-2012 05:37 AM

XMLStarlet has been recommended on LQ. I haven't needed to use it yet so cannot say how good it is etc.

David the H. 03-27-2012 10:18 AM

xml and html data structures are (generally) free-form in terms of whitespace and can contain nested values, both of which are difficult-to-impossible for regular expression and line-based programs like sed or awk to parse reliably.

So unless your extraction requirements are trivial and the input is guaranteed to be well-formed and uniform, you're much better off working with tools specifically designed for those languages, as suggested above.

xmlstarlet is probably a good place to start. Like catkin, I don't know much about it personally, but it has a good set of documentation here:

http://xmlstar.sourceforge.net/docs.php

Also, please use [code][/code] tags around your code and data, to preserve formatting and to improve readability. Please do not use quote tags, colors, or other fancy formatting.


All times are GMT -5. The time now is 07:48 PM.