LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-12-2010, 02:40 PM   #1
aharrison
LQ Newbie
 
Registered: Nov 2010
Posts: 3

Rep: Reputation: 0
Extract Data between XML tags


Hi All

I was wondering if someone could help me out, I've been trying to use various commands like sed, awk and grep but haven't had any luck (using shell scripting). I'm trying to extract the data between the following XML tag <BELNR>4797413</BELNR> but the data in the tag could be a variable length.
Any help would be great.

aharrison
 
Old 11-12-2010, 02:48 PM   #2
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Code:
root@reactor: echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'
4797413
You could use sed as illustrated here; but if doing any significant amount of parsing of markup languages like XML, you might want to look into tools that are specifically targeted at this sort of parsing, such as `xmlgawk` or Perl (which has a library for this as I recall).

PS - Note that the code I gave here, assumes that this tag is the only thing on a given line of the file (I've anchored it to the start of a line and end of line). You'll need to tune the regex a little if there is (or can be) other stuff on the line where the tag is.

Last edited by GrapefruiTgirl; 11-12-2010 at 02:54 PM.
 
Old 11-16-2010, 01:36 PM   #3
aharrison
LQ Newbie
 
Registered: Nov 2010
Posts: 3

Original Poster
Rep: Reputation: 0
Thank you but the problem is the data will be different between the <BELNR> tags


<E1EDK02 SEGMENT="1">
<QUALF>001</QUALF>
<BELNR>4797413</BELNR>
<DATUM>20101103</DATUM>
 
Old 11-16-2010, 03:32 PM   #4
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Quote:
Originally Posted by aharrison View Post
Thank you but the problem is the data will be different between the <BELNR> tags

<E1EDK02 SEGMENT="1">
<QUALF>001</QUALF>
<BELNR>4797413</BELNR>
<DATUM>20101103</DATUM>
OK, then what is the problem? The code I showed you, will return anything between <BELNR> and </BELNR>.

Perhaps you meant to word the problem differently?
 
Old 11-16-2010, 05:26 PM   #5
Hangdog42
LQ Veteran
 
Registered: Feb 2003
Location: Maryland
Distribution: Slackware
Posts: 7,803
Blog Entries: 1

Rep: Reputation: 416Reputation: 416Reputation: 416Reputation: 416Reputation: 416
Quote:
Originally Posted by GrapefruiTgirl
but if doing any significant amount of parsing of markup languages like XML, you might want to look into tools that are specifically targeted at this sort of parsing, such as `xmlgawk` or Perl (which has a library for this as I recall).
Perl is actually pretty good in dealing with XML. There is the basic XML::Parser library and there are variations such as XML:: DOM or PerlSAX. Using these sorts of libraries makes dealing with XML pretty trivial, and worth the time needed to learn.
 
Old 11-16-2010, 09:00 PM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
See also XML::Twig, XML::Simple (Perl).
As originally mentioned by GrapefruiTgirl, unless it's a trivial xml file, do use a proper parser, otherwise you'll end up tearing your hair out.
 
Old 11-17-2010, 11:54 AM   #7
aharrison
LQ Newbie
 
Registered: Nov 2010
Posts: 3

Original Poster
Rep: Reputation: 0
In the echo command you listed I won't know the value between the BELNR values, the number will constantly change.

echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'
 
Old 11-17-2010, 12:38 PM   #8
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Yes, well, again, I fail to see a problem.. Watch:
Code:
echo "<BELNR>4797413</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'
4797413

echo "<BELNR>Happy Birthday.</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'
Happy Birthday.

echo "<BELNR>Big piles of numbers: 3473278483749623746782364</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'
Big piles of numbers: 3473278483749623746782364
So, each time, the data changed, but its value was still returned successfully. It doesn't matter that the data has changed. Whatever the data is between the tags, it will be returned.

Maybe you wish to save this value in a variable?
Code:
shell$ VARIABLE=$(echo "<BELNR>Big piles of numbers: 3473278483749623746782364</BELNR>" | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|')
shell$ echo "$VARIABLE"
Big piles of numbers: 3473278483749623746782364
shell$
 
Old 11-17-2010, 01:15 PM   #9
Hangdog42
LQ Veteran
 
Registered: Feb 2003
Location: Maryland
Distribution: Slackware
Posts: 7,803
Blog Entries: 1

Rep: Reputation: 416Reputation: 416Reputation: 416Reputation: 416Reputation: 416
Quote:
Originally Posted by GrapefruiTgirl
Yes, well, again, I fail to see a problem.. Watch:
I'm guessing here, but I suspect what aharrison is saying is that the values between the tags may not be known prior to runtime. Essentially all you know prior to runtime is the info you need is between <BELNER> and </BELNER>. So to pass the right stuff to sed, you need to spend a bit of time parsing the file.

Which brings me back to the point that if you're dealing with XML, bash is almost certainly not the way to go. Using a language that has a decent set of XML libraries is going to save tons of headaches, particularly since they make parsing XML so simple.
 
Old 11-17-2010, 01:25 PM   #10
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Hangdog,

thanks for trying to clarify this for me. unfortunately (for me!) your attempt did not make it any more clear to me what is wrong here.

What I would like to see, is for the OP to show us several examples of the input data, and demonstrate on that data, what the problem is with the code that's been offered so far, and how this "different data between the tags" affects program operation...

Maybe it's just me being very dense, but I haven't a clue here, if I'm missing something very simple or what?
 
Old 11-17-2010, 02:04 PM   #11
Hangdog42
LQ Veteran
 
Registered: Feb 2003
Location: Maryland
Distribution: Slackware
Posts: 7,803
Blog Entries: 1

Rep: Reputation: 416Reputation: 416Reputation: 416Reputation: 416Reputation: 416
Quote:
Originally Posted by GrapefruiTgirl
Maybe it's just me being very dense, but I haven't a clue here, if I'm missing something very simple or what?
It's equally likely I'm making mistaken assumptions too. I think the disconnect may actually be before your echo statement. In other words, how do you pull the lines with <BELNR> out of the larger file and feed that into sed. In my experience with XML, frequently the only thing I have to go off of is the XML schema, which will tell you what tags you have, and what relationships those tags have, but says nothing about the information contained either between the tags or as attributes. So in this case, pretty much all we would know would be something like this:

Code:
<E1EDK02 SEGMENT>
<QUALF></QUALF>
<BELNR></BELNR>
<DATUM>/DATUM>
So we know there are four different tags, and one of those can have an attribute. In bash, if the file actually looked like I have it above, it would be pretty easy to pull out any line with the <BELNR> tag, in which case your code works great. Where I think your echo approach falls apart is if we're dealing with a file that looks like this:

Code:
<E1EDK02 SEGMENT><QUALF></QUALF><BELNR></BELNR><DATUM></DATUM>
Which (potentially) is valid XML (or at least I've had to deal with files like this). In this case being able to echo <BELNR>...</BELNR> is going to take a bit of work. It certainly can be done in bash, but it is pretty trivial to do it in a proper XML parser.

So that is my guess, and I think you're right, the OP probably needs to add a bit. I know I would choose the perl route, but that may also be because I'm a lot better with perl than I am with bash.
 
Old 11-17-2010, 03:20 PM   #12
martinbc
Member
 
Registered: Jun 2010
Distribution: Ubuntu, played with Puppy Slitaz & OpenSUSE
Posts: 40

Rep: Reputation: 4
What ever the source of the data it can be piped into GrapefruiTgirl's code. For example
Code:
cat file | sed 's|^<BELNR>\(.*\)</BELNR>$|\1|'
If other lines just need to be ignored completely with no output a slight change should work
Code:
source | sed -n 's|^<BELNR>\(.*\)</BELNR>$|\1|p'

Code:
<E1EDK02 SEGMENT><QUALF></QUALF><BELNR></BELNR><DATUM></DATUM>
Admittedly this is harder to parse in bash but that wasn't how aharrison's data looked.

Martin
 
Old 11-17-2010, 03:26 PM   #13
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Plus, we could grep the file first if we wanted, so filter out all but the <BELNR> lines..
Code:
grep '<BELNR>' input_file | sed ...
So that would use grep to find only lines with the BELNR tag, and stuff that data into the sed, which would return just the stuff between the tags.

Plus, remember that I put the ^ (anchor) in my regex, so the tag must be found at the beginning of a line - I mentioned this earlier, but felt it worth mentioning again, in case this is contributing in any way to the disconnect here.

Anyhow.. Interested in hearing from OP again..

Last edited by GrapefruiTgirl; 11-17-2010 at 03:28 PM.
 
Old 11-17-2010, 08:28 PM   #14
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,396
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Since the OP says the data is XML, there should be some expectation that the data may include newlines. By default, sed works on a line-at-a-time basis. Hangdog42's advice to use a full-on XML parser seems prudent to me.

I think, too, that aharrison didn't understand that the example using 'echo' was simply to illustrate that the sed script actually worked. In practice, the sed script would read from the XML file directly. At least that was how I interpreted it.

--- rod.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl Search and Replace XML tags conditionally rammyp_1979 Programming 15 10-22-2010 10:11 AM
Script to extract the fields in the agiml tags akhtar.bhat Linux - Software 1 12-17-2008 07:13 AM
XML Schema - redifinition of tags Omni Programming 2 09-20-2006 11:48 AM
Remove XML style tags using C kuronai Programming 8 11-12-2004 01:27 AM
Parsing XML tags with php, can't get attributes of a tag jimieee Programming 1 05-05-2004 11:32 AM


All times are GMT -5. The time now is 12:45 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration