LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   awk question - parsing xml file (https://www.linuxquestions.org/questions/programming-9/awk-question-parsing-xml-file-522192/)

epoo 01-23-2007 07:09 PM

awk question - parsing xml file
 
this probably isnt too hard, but ive searched and for the life of me cant find the answer.
im trying to parse an xml file to create an rss feed on a webpage. right now im just downloading the file, and i want to pull out the lines with title and link. the problem is that the link has to follow the title, so it looks like:
title
link
title
link

as opposed to:
title
title
link
link

im using awk, and ive tried 'awk /title/ || /link/ index.xml
and all variants i can think of, but it just hangs.
any suggestions ?
btw - ive tried awk '/title/ || /link/' index.xml but that just returns the whole thing.

jlinkels 01-23-2007 09:05 PM

It appears that if you want to look at a second line only if there has been a match on a first line, you cannot remain stateless. That is, you have to set a flag on the occurence of the first line, and then test the second line.

Right now I can't write and test something for you, but you might want to look at this post:

http://www.linuxquestions.org/questi...d.php?t=519099

It triggers on the occurence of a certain line, and then prints out a number of lines. In your case, you should test for the occurence of a second match (link). If is occurs, do something, if not, reset the flag and continue.

jlinkels

makyo 01-23-2007 10:37 PM

Hi.

If you wish to match the line with title, and also extract the immediately following line, you can use grep:
Code:

#!/bin/sh

# @(#) s1      Demonstrate grep + context.

F=${1-data1}

grep --after-context=1 title $F |
grep -v -e --

when run on a data file data1:
Code:

title
link
some stuff intermixed
more junk
title
link
junk1
junk2
title
link

will produce:
Code:

% ./s1
title
link
title
link
title
link

See man grep for details ... cheers, makyo

matthewg42 01-23-2007 11:19 PM

Although it's usual for XML to have a lot of line breaks, making it possible to select specific data by matching for a string on the line, it's not always the case, and you shouldn't count on it. Probably 95% of the time, the grep approach will work. Sadly, the other 5% will strike when you when you are too busy to fix it, and your inbox will fill up with angry messages.

You'd do a lot better to use a proper XML parsing library. This needn't be hard work. There's a Perl module called XML::Feed which works a treat:

Code:

#!/usr/bin/perl -w

use strict;

my $feed_url = "http://www.linuxquestions.org/syndicate/lqlatest.xml";

use XML::Feed;
my $feed = XML::Feed->parse(URI->new($feed_url)) or die XML::Feed->errstr;

print "The feed title is:  " . $feed->title . "\n";
foreach my $entry ($feed->entries) {
        print "Feed entry:\n";
        print "  title: " . $entry->title . "\n";
        print "  link:  " . $entry->link .  "\n\n";
}


makyo 01-24-2007 07:15 AM

Hi.

I found that I did not have the XML module for perl, so I sought a different solution. I cannot compare the output to that of matthewg42, but I did test it for more than one line of a title item. The format may need extra massaging depending on the final use:
Code:

#!/bin/sh

# @(#) s4      Demonstrate xml2 to text, plus grep + context.

URL="http://www.linuxquestions.org/syndicate/lqlatest.xml"

wget -q -O - $URL |
xml2 |
grep --after-context=1 title |
grep -v -e --

Although the output looks correct, I have not used xml2 except this once, so I am not completely confident that this will do what is required. You can test and see. I also tried it with a tighter grep pattern: 'item/(title|link)=', but it is unclear whether this produces anything better or worse.

I prefer generality (when the cost is not too high). So if you have perl + XML, the solution of matthewg42 can be used to get that nasty 5% ... cheers, makyo

( edit 1: addition )

epoo 01-24-2007 09:59 AM

thanks for all the replies. i finally figured it out, it was just a syntax problem that took me forever to find.
Code:

awk '/<title>/ || /<link>/' index.xml
is what i was looking for. anyways, i got my script working so it pulls the link and title lines from an xml file, removes the first two lines (which are junk in this particular file) and everything after the first 20 lines because i only want 10 listings. then it swaps them around and adds the necessary tags to make it into a working .html file.

makyo 01-24-2007 12:52 PM

Hi, epoo.

A few observations.

1) I ran the two commands:
Code:

awk '/title/ || /link/' filename
awk '/<title>/ || /<link>/' filename

on the file from the URL that matthewg42 mentioned, and only one line differed. Of course, your URL may be considerably different.

2) It is always good to post exactly the command and the data-file (or a pointer to it, like the URL) with which you are having trouble. Entering it from memory is prone to error.

3) I found an interesting feature of Linux grep with the after-context option. The grep does not appear to blindly take the next line after a match, but only when there is not a further match. That means that there could be several "line" matches (going by my grep solution above), and only when there is not another match does it gobble up the next line. I tested this with a different data file, and it seemed to work. I did not see this mentioned in the man page, but it is very useful -- if someone spots that piece of information, let me know, otherwise I will continue ranting about the poor state of most Linux man pages (compared to, say, Solaris man pages).

I'm glad you got it working; best wishes ... cheers, makyo

epoo 01-24-2007 02:13 PM

i was actually looking at the code both times i posted. i assume you say that there can be errors when posting code from memory because the line i tried and the line that fixed the problem are the same. i used wget in my script to download the .xml file - the first time i did it the xml file looked normal, with all the proper line breaks. the wget line ran the second time i ran my script, but after all my problems with awk, i looked at the xml file and while the first one was fine, the second one, which was actually the one i was working off of, was all just one line with no breaks. so im assuming awk saw it as all one line and returned the whole file. i shouldve checked the xml file again before posting, but since the first one appeared fine i didnt think the second one would be any different.
anyway - thanks for the reply.


All times are GMT -5. The time now is 08:39 PM.