LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 01-23-2007, 07:09 PM   #1
epoo
Member
 
Registered: Aug 2003
Distribution: slackware 11, ubuntu 7.04
Posts: 165

Rep: Reputation: 30
awk question - parsing xml file


this probably isnt too hard, but ive searched and for the life of me cant find the answer.
im trying to parse an xml file to create an rss feed on a webpage. right now im just downloading the file, and i want to pull out the lines with title and link. the problem is that the link has to follow the title, so it looks like:
title
link
title
link

as opposed to:
title
title
link
link

im using awk, and ive tried 'awk /title/ || /link/ index.xml
and all variants i can think of, but it just hangs.
any suggestions ?
btw - ive tried awk '/title/ || /link/' index.xml but that just returns the whole thing.

Last edited by epoo; 01-23-2007 at 07:17 PM.
 
Old 01-23-2007, 09:05 PM   #2
jlinkels
Senior Member
 
Registered: Oct 2003
Location: Bonaire
Distribution: Debian Lenny/Squeeze/Wheezy/Sid
Posts: 4,103

Rep: Reputation: 494Reputation: 494Reputation: 494Reputation: 494Reputation: 494
It appears that if you want to look at a second line only if there has been a match on a first line, you cannot remain stateless. That is, you have to set a flag on the occurence of the first line, and then test the second line.

Right now I can't write and test something for you, but you might want to look at this post:

http://www.linuxquestions.org/questi...d.php?t=519099

It triggers on the occurence of a certain line, and then prints out a number of lines. In your case, you should test for the occurence of a second match (link). If is occurs, do something, if not, reset the flag and continue.

jlinkels
 
Old 01-23-2007, 10:37 PM   #3
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.

If you wish to match the line with title, and also extract the immediately following line, you can use grep:
Code:
#!/bin/sh

# @(#) s1       Demonstrate grep + context.

F=${1-data1}

grep --after-context=1 title $F |
grep -v -e --
when run on a data file data1:
Code:
title
link
some stuff intermixed
more junk
title
link
junk1
junk2
title
link
will produce:
Code:
% ./s1
title
link
title
link
title
link
See man grep for details ... cheers, makyo
 
Old 01-23-2007, 11:19 PM   #4
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
Although it's usual for XML to have a lot of line breaks, making it possible to select specific data by matching for a string on the line, it's not always the case, and you shouldn't count on it. Probably 95% of the time, the grep approach will work. Sadly, the other 5% will strike when you when you are too busy to fix it, and your inbox will fill up with angry messages.

You'd do a lot better to use a proper XML parsing library. This needn't be hard work. There's a Perl module called XML::Feed which works a treat:

Code:
#!/usr/bin/perl -w

use strict;

my $feed_url = "http://www.linuxquestions.org/syndicate/lqlatest.xml";

use XML::Feed;
my $feed = XML::Feed->parse(URI->new($feed_url)) or die XML::Feed->errstr;

print "The feed title is:  " . $feed->title . "\n";
foreach my $entry ($feed->entries) {
        print "Feed entry:\n";
        print "  title: " . $entry->title . "\n";
        print "  link:  " . $entry->link .  "\n\n";
}
 
Old 01-24-2007, 07:15 AM   #5
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.

I found that I did not have the XML module for perl, so I sought a different solution. I cannot compare the output to that of matthewg42, but I did test it for more than one line of a title item. The format may need extra massaging depending on the final use:
Code:
#!/bin/sh

# @(#) s4       Demonstrate xml2 to text, plus grep + context.

URL="http://www.linuxquestions.org/syndicate/lqlatest.xml"

wget -q -O - $URL |
xml2 |
grep --after-context=1 title |
grep -v -e --
Although the output looks correct, I have not used xml2 except this once, so I am not completely confident that this will do what is required. You can test and see. I also tried it with a tighter grep pattern: 'item/(title|link)=', but it is unclear whether this produces anything better or worse.

I prefer generality (when the cost is not too high). So if you have perl + XML, the solution of matthewg42 can be used to get that nasty 5% ... cheers, makyo

( edit 1: addition )

Last edited by makyo; 01-24-2007 at 08:20 AM.
 
Old 01-24-2007, 09:59 AM   #6
epoo
Member
 
Registered: Aug 2003
Distribution: slackware 11, ubuntu 7.04
Posts: 165

Original Poster
Rep: Reputation: 30
thanks for all the replies. i finally figured it out, it was just a syntax problem that took me forever to find.
Code:
awk '/<title>/ || /<link>/' index.xml
is what i was looking for. anyways, i got my script working so it pulls the link and title lines from an xml file, removes the first two lines (which are junk in this particular file) and everything after the first 20 lines because i only want 10 listings. then it swaps them around and adds the necessary tags to make it into a working .html file.
 
Old 01-24-2007, 12:52 PM   #7
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi, epoo.

A few observations.

1) I ran the two commands:
Code:
awk '/title/ || /link/' filename
awk '/<title>/ || /<link>/' filename
on the file from the URL that matthewg42 mentioned, and only one line differed. Of course, your URL may be considerably different.

2) It is always good to post exactly the command and the data-file (or a pointer to it, like the URL) with which you are having trouble. Entering it from memory is prone to error.

3) I found an interesting feature of Linux grep with the after-context option. The grep does not appear to blindly take the next line after a match, but only when there is not a further match. That means that there could be several "line" matches (going by my grep solution above), and only when there is not another match does it gobble up the next line. I tested this with a different data file, and it seemed to work. I did not see this mentioned in the man page, but it is very useful -- if someone spots that piece of information, let me know, otherwise I will continue ranting about the poor state of most Linux man pages (compared to, say, Solaris man pages).

I'm glad you got it working; best wishes ... cheers, makyo
 
Old 01-24-2007, 02:13 PM   #8
epoo
Member
 
Registered: Aug 2003
Distribution: slackware 11, ubuntu 7.04
Posts: 165

Original Poster
Rep: Reputation: 30
i was actually looking at the code both times i posted. i assume you say that there can be errors when posting code from memory because the line i tried and the line that fixed the problem are the same. i used wget in my script to download the .xml file - the first time i did it the xml file looked normal, with all the proper line breaks. the wget line ran the second time i ran my script, but after all my problems with awk, i looked at the xml file and while the first one was fine, the second one, which was actually the one i was working off of, was all just one line with no breaks. so im assuming awk saw it as all one line and returned the whole file. i shouldve checked the xml file again before posting, but since the first one appeared fine i didnt think the second one would be any different.
anyway - thanks for the reply.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Sed or Awk question, looking for parsing help rwartell Linux - Software 2 05-17-2006 11:59 PM
Sed or Awk question, looking for parsing help rwartell Programming 1 05-17-2006 04:42 PM
XML parsing in C irfanhab Programming 3 05-06-2006 12:47 AM
how to delete duplicates entries in xml file using sed/awk/sort ? catzilla Linux - Software 1 10-28-2005 02:57 PM


All times are GMT -5. The time now is 06:05 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration