LinuxQuestions.org - Extract Data between XML tags

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Extract Data between XML tags (https://www.linuxquestions.org/questions/linux-newbie-8/extract-data-between-xml-tags-843941/)

aharrison

11-12-2010 01:40 PM

Extract Data between XML tags

Hi All

I was wondering if someone could help me out, I've been trying to use various commands like sed, awk and grep but haven't had any luck (using shell scripting). I'm trying to extract the data between the following XML tag <BELNR>4797413</BELNR> but the data in the tag could be a variable length.
Any help would be great.

aharrison

GrapefruiTgirl

11-12-2010 01:48 PM

Code:

root@reactor: echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'

4797413

You could use sed as illustrated here; but if doing any significant amount of parsing of markup languages like XML, you might want to look into tools that are specifically targeted at this sort of parsing, such as `xmlgawk` or Perl (which has a library for this as I recall).

PS - Note that the code I gave here, assumes that this tag is the only thing on a given line of the file (I've anchored it to the start of a line and end of line). You'll need to tune the regex a little if there is (or can be) other stuff on the line where the tag is.

aharrison

11-16-2010 12:36 PM

Thank you but the problem is the data will be different between the <BELNR> tags

<E1EDK02 SEGMENT="1">
<QUALF>001</QUALF>
<BELNR>4797413</BELNR>
<DATUM>20101103</DATUM>

GrapefruiTgirl

11-16-2010 02:32 PM

Quote:

Originally Posted by aharrison (Post 4160954)

Thank you but the problem is the data will be different between the <BELNR> tags

<E1EDK02 SEGMENT="1">
<QUALF>001</QUALF>
<BELNR>4797413</BELNR>
<DATUM>20101103</DATUM>

OK, then what is the problem? The code I showed you, will return anything between <BELNR> and </BELNR>.

Perhaps you meant to word the problem differently?

Hangdog42

11-16-2010 04:26 PM

Quote:

Originally Posted by GrapefruiTgirl

but if doing any significant amount of parsing of markup languages like XML, you might want to look into tools that are specifically targeted at this sort of parsing, such as `xmlgawk` or Perl (which has a library for this as I recall).

Perl is actually pretty good in dealing with XML. There is the basic XML::Parser library and there are variations such as XML:: DOM or PerlSAX. Using these sorts of libraries makes dealing with XML pretty trivial, and worth the time needed to learn.

chrism01

11-16-2010 08:00 PM

See also XML::Twig, XML::Simple (Perl).
As originally mentioned by GrapefruiTgirl, unless it's a trivial xml file, do use a proper parser, otherwise you'll end up tearing your hair out.
:)

aharrison

11-17-2010 10:54 AM

In the echo command you listed I won't know the value between the BELNR values, the number will constantly change.

echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>$.*$</BELNR>$|\1|'

GrapefruiTgirl

11-17-2010 11:38 AM

Yes, well, again, I fail to see a problem.. Watch:

Code:

echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'

4797413



echo "<BELNR>Happy Birthday.</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'

Happy Birthday.



echo "<BELNR>Big piles of numbers: 3473278483749623746782364</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'

Big piles of numbers: 3473278483749623746782364

So, each time, the data changed, but its value was still returned successfully. It doesn't matter that the data has changed. Whatever the data is between the tags, it will be returned.

Maybe you wish to save this value in a variable?

Code:

shell$ VARIABLE=$(echo "<BELNR>Big piles of numbers: 3473278483749623746782364</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|')

shell$ echo "$VARIABLE"

Big piles of numbers: 3473278483749623746782364

shell$

Hangdog42

11-17-2010 12:15 PM

Quote:

Originally Posted by GrapefruiTgirl

Yes, well, again, I fail to see a problem.. Watch:

I'm guessing here, but I suspect what aharrison is saying is that the values between the tags may not be known prior to runtime. Essentially all you know prior to runtime is the info you need is between <BELNER> and </BELNER>. So to pass the right stuff to sed, you need to spend a bit of time parsing the file.

Which brings me back to the point that if you're dealing with XML, bash is almost certainly not the way to go. Using a language that has a decent set of XML libraries is going to save tons of headaches, particularly since they make parsing XML so simple.

GrapefruiTgirl

11-17-2010 12:25 PM

Hangdog,

thanks for trying to clarify this for me. :scratch: unfortunately (for me!) your attempt did not make it any more clear to me what is wrong here.

What I would like to see, is for the OP to show us several examples of the input data, and demonstrate on that data, what the problem is with the code that's been offered so far, and how this "different data between the tags" affects program operation...

Maybe it's just me being very dense, but I haven't a clue here, if I'm missing something very simple or what? :scratch: :confused:

Hangdog42

11-17-2010 01:04 PM

Quote:

Originally Posted by GrapefruiTgirl

Maybe it's just me being very dense, but I haven't a clue here, if I'm missing something very simple or what?

It's equally likely I'm making mistaken assumptions too. I think the disconnect may actually be before your echo statement. In other words, how do you pull the lines with <BELNR> out of the larger file and feed that into sed. In my experience with XML, frequently the only thing I have to go off of is the XML schema, which will tell you what tags you have, and what relationships those tags have, but says nothing about the information contained either between the tags or as attributes. So in this case, pretty much all we would know would be something like this:

Code:

<E1EDK02 SEGMENT>

<QUALF></QUALF>

<BELNR></BELNR>

<DATUM>/DATUM>

So we know there are four different tags, and one of those can have an attribute. In bash, if the file actually looked like I have it above, it would be pretty easy to pull out any line with the <BELNR> tag, in which case your code works great. Where I think your echo approach falls apart is if we're dealing with a file that looks like this:

Code:

<E1EDK02 SEGMENT><QUALF></QUALF><BELNR></BELNR><DATUM></DATUM>

Which (potentially) is valid XML (or at least I've had to deal with files like this). In this case being able to echo <BELNR>...</BELNR> is going to take a bit of work. It certainly can be done in bash, but it is pretty trivial to do it in a proper XML parser.

So that is my guess, and I think you're right, the OP probably needs to add a bit. I know I would choose the perl route, but that may also be because I'm a lot better with perl than I am with bash.

martinbc

11-17-2010 02:20 PM

What ever the source of the data it can be piped into GrapefruiTgirl's code. For example

Code:

cat file | sed 's|^<BELNR>$.*$</BELNR>$|\1|'

If other lines just need to be ignored completely with no output a slight change should work

Code:

source | sed -n 's|^<BELNR>$.*$</BELNR>$|\1|p'

Code:

<E1EDK02 SEGMENT><QUALF></QUALF><BELNR></BELNR><DATUM></DATUM>

Admittedly this is harder to parse in bash but that wasn't how aharrison's data looked.

Martin

GrapefruiTgirl

11-17-2010 02:26 PM

Plus, we could grep the file first if we wanted, so filter out all but the <BELNR> lines..

Code:

grep '<BELNR>' input_file | sed ...

So that would use grep to find only lines with the BELNR tag, and stuff that data into the sed, which would return just the stuff between the tags.

Plus, remember that I put the ^ (anchor) in my regex, so the tag must be found at the beginning of a line - I mentioned this earlier, but felt it worth mentioning again, in case this is contributing in any way to the disconnect here.

Anyhow.. Interested in hearing from OP again..

theNbomr

11-17-2010 07:28 PM

Since the OP says the data is XML, there should be some expectation that the data may include newlines. By default, sed works on a line-at-a-time basis. Hangdog42's advice to use a full-on XML parser seems prudent to me.

I think, too, that aharrison didn't understand that the example using 'echo' was simply to illustrate that the sed script actually worked. In practice, the sed script would read from the XML file directly. At least that was how I interpreted it.

--- rod.

All times are GMT -5. The time now is 10:04 AM.