Extract Data between XML tags
Hi All
I was wondering if someone could help me out, I've been trying to use various commands like sed, awk and grep but haven't had any luck (using shell scripting). I'm trying to extract the data between the following XML tag <BELNR>4797413</BELNR> but the data in the tag could be a variable length. Any help would be great. aharrison |
Code:
root@reactor: echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|' PS - Note that the code I gave here, assumes that this tag is the only thing on a given line of the file (I've anchored it to the start of a line and end of line). You'll need to tune the regex a little if there is (or can be) other stuff on the line where the tag is. |
Thank you but the problem is the data will be different between the <BELNR> tags
<E1EDK02 SEGMENT="1"> <QUALF>001</QUALF> <BELNR>4797413</BELNR> <DATUM>20101103</DATUM> |
Quote:
Perhaps you meant to word the problem differently? |
Quote:
|
See also XML::Twig, XML::Simple (Perl).
As originally mentioned by GrapefruiTgirl, unless it's a trivial xml file, do use a proper parser, otherwise you'll end up tearing your hair out. :) |
In the echo command you listed I won't know the value between the BELNR values, the number will constantly change.
echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|' |
Yes, well, again, I fail to see a problem.. Watch:
Code:
echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|' Maybe you wish to save this value in a variable? Code:
shell$ VARIABLE=$(echo "<BELNR>Big piles of numbers: 3473278483749623746782364</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|') |
Quote:
Which brings me back to the point that if you're dealing with XML, bash is almost certainly not the way to go. Using a language that has a decent set of XML libraries is going to save tons of headaches, particularly since they make parsing XML so simple. |
Hangdog,
thanks for trying to clarify this for me. :scratch: unfortunately (for me!) your attempt did not make it any more clear to me what is wrong here. What I would like to see, is for the OP to show us several examples of the input data, and demonstrate on that data, what the problem is with the code that's been offered so far, and how this "different data between the tags" affects program operation... Maybe it's just me being very dense, but I haven't a clue here, if I'm missing something very simple or what? :scratch: :confused: |
Quote:
Code:
<E1EDK02 SEGMENT> Code:
<E1EDK02 SEGMENT><QUALF></QUALF><BELNR></BELNR><DATUM></DATUM> So that is my guess, and I think you're right, the OP probably needs to add a bit. I know I would choose the perl route, but that may also be because I'm a lot better with perl than I am with bash. |
What ever the source of the data it can be piped into GrapefruiTgirl's code. For example
Code:
cat file | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|' Code:
source | sed -n 's|^<BELNR>\(.*\)</BELNR>$|\1|p' Code:
<E1EDK02 SEGMENT><QUALF></QUALF><BELNR></BELNR><DATUM></DATUM> Martin |
Plus, we could grep the file first if we wanted, so filter out all but the <BELNR> lines..
Code:
grep '<BELNR>' input_file | sed ... Plus, remember that I put the ^ (anchor) in my regex, so the tag must be found at the beginning of a line - I mentioned this earlier, but felt it worth mentioning again, in case this is contributing in any way to the disconnect here. Anyhow.. Interested in hearing from OP again.. |
Since the OP says the data is XML, there should be some expectation that the data may include newlines. By default, sed works on a line-at-a-time basis. Hangdog42's advice to use a full-on XML parser seems prudent to me.
I think, too, that aharrison didn't understand that the example using 'echo' was simply to illustrate that the sed script actually worked. In practice, the sed script would read from the XML file directly. At least that was how I interpreted it. --- rod. |
All times are GMT -5. The time now is 10:04 AM. |