Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
As said, I need to read an RSS feed, and summarize it into an HTML file.
I was thinking of downloading the file with wget, add HTML header + foot infos, and use grep+sed to pull the relevant lines from the input file into a bullet list, but I'm struggline a bit when building the hyperlinks:
You need a proper parser to deal with XML, trying to fit it into regex won't work out.
If you would like to use Perl, then you can try the module XML::TreeBuilder to parse the XML and to generate your HTML. There are also corresponding Python modules.
Perl is probably easier than Python and aside from general XML parsing there is a module for RSS and Atom feeds specifically. From the manual page, loosely:
Code:
#!/usr/bin/perl
use XML::Feed;
use strict;
use warnings;
my $feed = XML::Feed->parse(URI->new('https://example.com/feed/'))
or die XML::Feed->errstr;
print $feed->title, "\n";
for my $entry ($feed->entries) {
print $entry->title,"\n";
}
exit(0);
Untested. There should also be a Python feed library somewhere I expect.
It can. See "man perlre" in the section and "Modifiers" to look at the m option for multi-line matching in m// and s/// there. It is needed for multi-line matching. Most feeds will have only \n and not \r\n so the pattern would have to take that into account. But once more for emphasis, XML data requires a proper parser and cannot be managed with regex.
Also -p and -n read in one line at a time, as delimited by \n anyway. You'd have to set -0 to have the record separator be a null or something other than \n. See "man perlrun"
It's easier with a parser.
Last edited by Turbocapitalist; 11-17-2022 at 10:30 AM.
Note the s///m and -0 mentioned earlier. Again, that is a very brittle approach and neither portable nor enduring. XML and SGML are not to be parsed with regex. You might get away with it with just the one feed, for a limited time, but in the long run it will break.
See instead the example in post #7 above for a method which will work with all Atom or RSS feeds.
I have to agree with "But trying to write your own RSS parser using line-based regex tools is the wrong approach."
Even if you can it to work once, it IS brittle.
Definitely use a proper parser module in eg Perl - you'll thank us later.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.