awk question - parsing xml file
this probably isnt too hard, but ive searched and for the life of me cant find the answer.
im trying to parse an xml file to create an rss feed on a webpage. right now im just downloading the file, and i want to pull out the lines with title and link. the problem is that the link has to follow the title, so it looks like: title link title link as opposed to: title title link link im using awk, and ive tried 'awk /title/ || /link/ index.xml and all variants i can think of, but it just hangs. any suggestions ? btw - ive tried awk '/title/ || /link/' index.xml but that just returns the whole thing. |
It appears that if you want to look at a second line only if there has been a match on a first line, you cannot remain stateless. That is, you have to set a flag on the occurence of the first line, and then test the second line.
Right now I can't write and test something for you, but you might want to look at this post: http://www.linuxquestions.org/questi...d.php?t=519099 It triggers on the occurence of a certain line, and then prints out a number of lines. In your case, you should test for the occurence of a second match (link). If is occurs, do something, if not, reset the flag and continue. jlinkels |
Hi.
If you wish to match the line with title, and also extract the immediately following line, you can use grep: Code:
#!/bin/sh Code:
title Code:
% ./s1 |
Although it's usual for XML to have a lot of line breaks, making it possible to select specific data by matching for a string on the line, it's not always the case, and you shouldn't count on it. Probably 95% of the time, the grep approach will work. Sadly, the other 5% will strike when you when you are too busy to fix it, and your inbox will fill up with angry messages.
You'd do a lot better to use a proper XML parsing library. This needn't be hard work. There's a Perl module called XML::Feed which works a treat: Code:
#!/usr/bin/perl -w |
Hi.
I found that I did not have the XML module for perl, so I sought a different solution. I cannot compare the output to that of matthewg42, but I did test it for more than one line of a title item. The format may need extra massaging depending on the final use: Code:
#!/bin/sh I prefer generality (when the cost is not too high). So if you have perl + XML, the solution of matthewg42 can be used to get that nasty 5% ... cheers, makyo ( edit 1: addition ) |
thanks for all the replies. i finally figured it out, it was just a syntax problem that took me forever to find.
Code:
awk '/<title>/ || /<link>/' index.xml |
Hi, epoo.
A few observations. 1) I ran the two commands: Code:
awk '/title/ || /link/' filename 2) It is always good to post exactly the command and the data-file (or a pointer to it, like the URL) with which you are having trouble. Entering it from memory is prone to error. 3) I found an interesting feature of Linux grep with the after-context option. The grep does not appear to blindly take the next line after a match, but only when there is not a further match. That means that there could be several "line" matches (going by my grep solution above), and only when there is not another match does it gobble up the next line. I tested this with a different data file, and it seemed to work. I did not see this mentioned in the man page, but it is very useful -- if someone spots that piece of information, let me know, otherwise I will continue ranting about the poor state of most Linux man pages (compared to, say, Solaris man pages). I'm glad you got it working; best wishes ... cheers, makyo |
i was actually looking at the code both times i posted. i assume you say that there can be errors when posting code from memory because the line i tried and the line that fixed the problem are the same. i used wget in my script to download the .xml file - the first time i did it the xml file looked normal, with all the proper line breaks. the wget line ran the second time i ran my script, but after all my problems with awk, i looked at the xml file and while the first one was fine, the second one, which was actually the one i was working off of, was all just one line with no breaks. so im assuming awk saw it as all one line and returned the whole file. i shouldve checked the xml file again before posting, but since the first one appeared fine i didnt think the second one would be any different.
anyway - thanks for the reply. |
All times are GMT -5. The time now is 08:19 PM. |