ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
this probably isnt too hard, but ive searched and for the life of me cant find the answer.
im trying to parse an xml file to create an rss feed on a webpage. right now im just downloading the file, and i want to pull out the lines with title and link. the problem is that the link has to follow the title, so it looks like:
title
link
title
link
as opposed to:
title
title
link
link
im using awk, and ive tried 'awk /title/ || /link/ index.xml
and all variants i can think of, but it just hangs.
any suggestions ?
btw - ive tried awk '/title/ || /link/' index.xml but that just returns the whole thing.
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195
Rep:
It appears that if you want to look at a second line only if there has been a match on a first line, you cannot remain stateless. That is, you have to set a flag on the occurence of the first line, and then test the second line.
Right now I can't write and test something for you, but you might want to look at this post:
It triggers on the occurence of a certain line, and then prints out a number of lines. In your case, you should test for the occurence of a second match (link). If is occurs, do something, if not, reset the flag and continue.
Although it's usual for XML to have a lot of line breaks, making it possible to select specific data by matching for a string on the line, it's not always the case, and you shouldn't count on it. Probably 95% of the time, the grep approach will work. Sadly, the other 5% will strike when you when you are too busy to fix it, and your inbox will fill up with angry messages.
You'd do a lot better to use a proper XML parsing library. This needn't be hard work. There's a Perl module called XML::Feed which works a treat:
Code:
#!/usr/bin/perl -w
use strict;
my $feed_url = "http://www.linuxquestions.org/syndicate/lqlatest.xml";
use XML::Feed;
my $feed = XML::Feed->parse(URI->new($feed_url)) or die XML::Feed->errstr;
print "The feed title is: " . $feed->title . "\n";
foreach my $entry ($feed->entries) {
print "Feed entry:\n";
print " title: " . $entry->title . "\n";
print " link: " . $entry->link . "\n\n";
}
I found that I did not have the XML module for perl, so I sought a different solution. I cannot compare the output to that of matthewg42, but I did test it for more than one line of a title item. The format may need extra massaging depending on the final use:
Code:
#!/bin/sh
# @(#) s4 Demonstrate xml2 to text, plus grep + context.
URL="http://www.linuxquestions.org/syndicate/lqlatest.xml"
wget -q -O - $URL |
xml2 |
grep --after-context=1 title |
grep -v -e --
Although the output looks correct, I have not used xml2 except this once, so I am not completely confident that this will do what is required. You can test and see. I also tried it with a tighter grep pattern: 'item/(title|link)=', but it is unclear whether this produces anything better or worse.
I prefer generality (when the cost is not too high). So if you have perl + XML, the solution of matthewg42 can be used to get that nasty 5% ... cheers, makyo
thanks for all the replies. i finally figured it out, it was just a syntax problem that took me forever to find.
Code:
awk '/<title>/ || /<link>/' index.xml
is what i was looking for. anyways, i got my script working so it pulls the link and title lines from an xml file, removes the first two lines (which are junk in this particular file) and everything after the first 20 lines because i only want 10 listings. then it swaps them around and adds the necessary tags to make it into a working .html file.
on the file from the URL that matthewg42 mentioned, and only one line differed. Of course, your URL may be considerably different.
2) It is always good to post exactly the command and the data-file (or a pointer to it, like the URL) with which you are having trouble. Entering it from memory is prone to error.
3) I found an interesting feature of Linux grep with the after-context option. The grep does not appear to blindly take the next line after a match, but only when there is not a further match. That means that there could be several "line" matches (going by my grep solution above), and only when there is not another match does it gobble up the next line. I tested this with a different data file, and it seemed to work. I did not see this mentioned in the man page, but it is very useful -- if someone spots that piece of information, let me know, otherwise I will continue ranting about the poor state of most Linux man pages (compared to, say, Solaris man pages).
I'm glad you got it working; best wishes ... cheers, makyo
i was actually looking at the code both times i posted. i assume you say that there can be errors when posting code from memory because the line i tried and the line that fixed the problem are the same. i used wget in my script to download the .xml file - the first time i did it the xml file looked normal, with all the proper line breaks. the wget line ran the second time i ran my script, but after all my problems with awk, i looked at the xml file and while the first one was fine, the second one, which was actually the one i was working off of, was all just one line with no breaks. so im assuming awk saw it as all one line and returned the whole file. i shouldve checked the xml file again before posting, but since the first one appeared fine i didnt think the second one would be any different.
anyway - thanks for the reply.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.