LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 11-17-2022, 06:07 AM   #1
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 658

Rep: Reputation: 35
Question Linux tool to turn RSS into HTML page?


Hello,

As said, I need to read an RSS feed, and summarize it into an HTML file.

I was thinking of downloading the file with wget, add HTML header + foot infos, and use grep+sed to pull the relevant lines from the input file into a bullet list, but I'm struggline a bit when building the hyperlinks:

Code:
<!doctype html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<link rel="shortcut icon" href="https://www.acme.com/favicon.ico" title="Favicon" />
</head>
<body>
<h1>My feed</h1>
<ul>
<li><a href="LINK">TITLE - PUBDATE></li>
etc.
</ul>
</body>
</html>

wget -O feed.xml https://www.acme.com/feed/

grep -Poha "^\t{2}<(title|link|pubDate)>.+</(title|link|pubDate)>$" feed.xml > input.xml

cat "<html>…<body><ul>" > feed.html

HOW TO BUILD LIST?
grep -Poha "^\t{2}<link>(.+)</link>$" input.xml | sed -r "s@^\t{2}<link>(.+)</link>$@LINK=\1@"
grep -Poha "^\t{2}<title>(.+)</title>$" input.xml | sed -r "s@^\t{2}<title>(.+)</title>$@<li>\1</li>@"
grep -Poha "^\t{2}<pubDate>(.+)</pubDate>$" input.xml | sed -r "s@^\t{2}<pubDate>(.+)</pubDate>$@PUBDATE=\1@"

cat "</body></html>" >> feed.html
Is there a simpler solution?

Thank you.
 
Old 11-17-2022, 09:16 AM   #2
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,599

Rep: Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546

RSS is XML, and thus you can use XSLT to convert it.

Using XMLStarlet to do so looks like this:
Code:
xml tr template.xslt feed.xml > feed.html

Where template.xslt looks something like:
Code:
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <xsl:output method="html" doctype-system="about:legacy-compat" />

   <xsl:template match="/rss/channel">
      <html>
         <head>
            <title><xsl:value-of select="title" /></title>
         </head>
         <body>
            <h1><xsl:value-of select="title" /></h1>
            <p><xsl:value-of select="description" /></p>
            <ul>
               <xsl:for-each select="./item">
                  <li>
                     <h2><a href="{ link }"><xsl:value-of select="title" /></a></h2>
                     <p><xsl:value-of select="pubDate" /></p>
                     <p><xsl:value-of select="description" /></p>
                  </li>
               </xsl:for-each>
            </ul>
         </body>
      </html>
   </xsl:template>

</xsl:stylesheet>
(You can get fancier examples if you search for "rss xslt", or read some XSLT docs and edit the above however you want.)

 
Old 11-17-2022, 09:24 AM   #3
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 658

Original Poster
Rep: Reputation: 35
Thanks, I don't know the first thing about XSLT.

For some reason, Perl only finds the first regex:

Code:
#OK
perl -pe 's@<title>(.+?)</title>\n@TITLE=$1\n@' input2.xml > output.xml

#NOK
perl -pe 's@<title>(.+?)</title>\n<link>(.+?)</link>\n@TITLE=$1 LINK=$2@' input2.xml > output.xml

#NOK
perl -pe 's@<title>(.+?)</title>\n<link>(.+?)</link>\n@TITLE=$1 LINK=$2\n@gsi' input2.xml > output.xml

#NOK
perl -pe 's@<title>(.+?)</title>\r\n<link>(.+?)</link>\r\n@TITLE=$1 LINK=$2\n@gsi' input2.xml > output.xml
 
Old 11-17-2022, 09:38 AM   #4
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
You need a proper parser to deal with XML, trying to fit it into regex won't work out.

If you would like to use Perl, then you can try the module XML::TreeBuilder to parse the XML and to generate your HTML. There are also corresponding Python modules.

XSLT is the other option.
 
1 members found this post helpful.
Old 11-17-2022, 09:51 AM   #5
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,223

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Converting RSS to HTML is a large part of what my podcast client does.

https://github.com/duganchen/podfeeds

I wrote the code to do it myself because I couldn’t find anything prebuilt.
 
Old 11-17-2022, 09:51 AM   #6
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 658

Original Poster
Rep: Reputation: 35
Thanks. Before writing a Python script, I wanted to see if I could get it to work with just the usual suspects (wget, grep, sed, and Perl).

How come even Perl can't find carriage returns?

Code:
perl -pe 's@<title>(.+?)</title>\r\n<link>(.+?)</link>\r\n@TITLE=$1 LINK=$2\n@gsi' input2.xml > output.xml

perl -pe 's@<title>(.+?)</title>\r\n<link>@TITLE=$1 LINK=@gsi' input2.xml > output.xml

perl -pe 's@<title>(.+?)</title>\n<link>@TITLE=$1 LINK=@gsi' input2.xml > output.xml

perl -pe 's@<title>(.+?)</title>\r<link>@TITLE=$1 LINK=@gsi' input2.xml > output.xml

perl -pe 's@<title>(.+?)</title>\R<link>@TITLE=$1 LINK=@gsi' input2.xml > output.xml

Last edited by littlebigman; 11-17-2022 at 09:54 AM.
 
Old 11-17-2022, 09:59 AM   #7
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
Perl is probably easier than Python and aside from general XML parsing there is a module for RSS and Atom feeds specifically. From the manual page, loosely:

Code:
#!/usr/bin/perl                                                                 
                                                                                
use XML::Feed;                                                                  
use strict;                                                                     
use warnings;

my $feed = XML::Feed->parse(URI->new('https://example.com/feed/'))
    or die XML::Feed->errstr;

print $feed->title, "\n";

for my $entry ($feed->entries) {
    print $entry->title,"\n";
}

exit(0);
Untested. There should also be a Python feed library somewhere I expect.
 
1 members found this post helpful.
Old 11-17-2022, 10:22 AM   #8
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 658

Original Poster
Rep: Reputation: 35
Thank you. It's more involved than I thought.

I guess Perl eats carriage returns, hence the failed hits.
 
Old 11-17-2022, 10:27 AM   #9
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
Quote:
Originally Posted by littlebigman View Post
How come even Perl can't find carriage returns?
It can. See "man perlre" in the section and "Modifiers" to look at the m option for multi-line matching in m// and s/// there. It is needed for multi-line matching. Most feeds will have only \n and not \r\n so the pattern would have to take that into account. But once more for emphasis, XML data requires a proper parser and cannot be managed with regex.

Also -p and -n read in one line at a time, as delimited by \n anyway. You'd have to set -0 to have the record separator be a null or something other than \n. See "man perlrun"

It's easier with a parser.

Last edited by Turbocapitalist; 11-17-2022 at 10:30 AM.
 
Old 11-17-2022, 10:32 AM   #10
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,599

Rep: Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546Reputation: 2546
Quote:
Originally Posted by littlebigman View Post
Thanks, I don't know the first thing about XSLT.
You don't need to!

XSLT is really simple if you want to learn it, but you can also just search for a template you like, save it, then run that one-line command.

Or using an existing RSS library in Python/Perl/whatever is another valid approach.


But trying to write your own RSS parser using line-based regex tools is the wrong approach.

 
1 members found this post helpful.
Old 11-17-2022, 10:42 AM   #11
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 658

Original Poster
Rep: Reputation: 35
This works:

Code:
wget -O feed.xml https://www.acme.com/feed/

#Find relevant lines, and remove leading tabs
grep -Poha "^\t{2}<(title|link|pubDate)>.+</(title|link|pubDate)>$" feed.xml |
grep -Poha "[^\t].+" > input.xml

#Build HTML file
echo '<!doctype html>' > feed.html
echo '<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8">' >> feed.html
echo '<link rel="shortcut icon" href="https://www.acme.com/favicon.ico" title="Favicon" /></head>' >> feed.html
echo '<body><h1>My feed</h1><ul>' >> feed.html

#Replace CRLF with #, turn into hyperlink
cat input.xml | tr '\n' '#' | perl -pe 's@<title>(.+?)</title>#<link>(.+?)</link>#<pubDate>(.+?)</pubDate>#@<li><a href="$2">$1 - $3</a></li>\n@gsi' >> feed.html

echo '</ul></body></html>>' >> feed.html

rm input.xml
 
Old 11-17-2022, 10:54 AM   #12
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,307
Blog Entries: 3

Rep: Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721Reputation: 3721
Quote:
Originally Posted by littlebigman View Post
This works:

Code:
perl -p -0 -e 's@<title>(.+?)</title>#<link>(.+?)</link>#<pubDate>(.+?)</pubDate>#@<li><a href="$2">$1 - $3</a></li>\n@mgsi' input.xml > output.html
Note the s///m and -0 mentioned earlier. Again, that is a very brittle approach and neither portable nor enduring. XML and SGML are not to be parsed with regex. You might get away with it with just the one feed, for a limited time, but in the long run it will break.

See instead the example in post #7 above for a method which will work with all Atom or RSS feeds.
 
Old 11-17-2022, 10:58 AM   #13
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 658

Original Poster
Rep: Reputation: 35
Thanks. I didn't see it because I didn't hit refresh before appending.
 
Old 11-18-2022, 04:37 AM   #14
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 658

Original Poster
Rep: Reputation: 35
A bit simplified with Perl:

Code:
#Find relevant lines, and remove leading tabs
wget -qO - https://www.acme.com/feed/ | grep -Poha "^\t{2}<(title|link|pubDate)>.+</(title|link|pubDate)>$" | sed -r "s@\t@@g" > input.xml

#Build HTML file; header
echo '<!doctype html>' > feed.html
echo '<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8">' >> feed.html
echo '<link rel="shortcut icon" href="https://www.acme.com/favicon.ico" title="Favicon" /></head>' >> feed.html
echo '<body><h1>My feed</h1><ul>' >> feed.html

#Build list of hyperlinks
perl -0777 -pe 's@^<title>(.+)</title>\R<link>(.+?)</link>\R<pubDate>(.+?)</pubDate>\R@<li><a href="$2">$1 - $3</a></li>\r\n@mg' input.xml >> feed.html

#footer
echo '</ul></body></html>>' >> feed.html

#clean
rm input.xml

Last edited by littlebigman; 11-18-2022 at 05:27 AM.
 
Old 11-21-2022, 11:42 PM   #15
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
I have to agree with "But trying to write your own RSS parser using line-based regex tools is the wrong approach."
Even if you can it to work once, it IS brittle.
Definitely use a proper parser module in eg Perl - you'll thank us later.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: RSS Guard 3.0.0 adds Tiny Tiny RSS support LXer Syndicated Linux News 0 12-17-2015 09:01 AM
[SOLVED] xml::rss fetch rss variables ted_chou12 Linux - Software 2 02-13-2012 03:48 AM
LXer: Tiny Tiny RSS: A very very useful RSS reader LXer Syndicated Linux News 0 05-16-2007 02:46 AM
LXer: Expand RSS Capabilities with RSS Extensions LXer Syndicated Linux News 0 08-22-2006 10:54 AM
RSS XML File to be read and generate HTML page redhatrosh General 10 02-15-2006 02:16 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 04:14 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration