[SOLVED] Linux tool to turn RSS into HTML page?

littlebigman · 11-17-2022, 06:07 AM

Hello,

As said, I need to read an RSS feed, and summarize it into an HTML file.

I was thinking of downloading the file with wget, add HTML header + foot infos, and use grep+sed to pull the relevant lines from the input file into a bullet list, but I'm struggline a bit when building the hyperlinks:

Code:

<!doctype html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<link rel="shortcut icon" href="https://www.acme.com/favicon.ico" title="Favicon" />
</head>
<body>
<h1>My feed</h1>
<ul>
<li><a href="LINK">TITLE - PUBDATE></li>
etc.
</ul>
</body>
</html>

wget -O feed.xml https://www.acme.com/feed/

grep -Poha "^\t{2}<(title|link|pubDate)>.+</(title|link|pubDate)>$" feed.xml > input.xml

cat "<html>…<body><ul>" > feed.html

HOW TO BUILD LIST?
grep -Poha "^\t{2}<link>(.+)</link>$" input.xml | sed -r "s@^\t{2}<link>(.+)</link>$@LINK=\1@"
grep -Poha "^\t{2}<title>(.+)</title>$" input.xml | sed -r "s@^\t{2}<title>(.+)</title>$@<li>\1</li>@"
grep -Poha "^\t{2}<pubDate>(.+)</pubDate>$" input.xml | sed -r "s@^\t{2}<pubDate>(.+)</pubDate>$@PUBDATE=\1@"

cat "</body></html>" >> feed.html

Is there a simpler solution?

Thank you.

boughtonp · 11-17-2022, 09:16 AM

RSS is XML, and thus you can use XSLT to convert it.

Using XMLStarlet to do so looks like this:

Code:

xml tr template.xslt feed.xml > feed.html

Where template.xslt looks something like:

Code:

<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <xsl:output method="html" doctype-system="about:legacy-compat" />

   <xsl:template match="/rss/channel">
      <html>
         <head>
            <title><xsl:value-of select="title" /></title>
         </head>
         <body>
            <h1><xsl:value-of select="title" /></h1>
            <p><xsl:value-of select="description" /></p>
            <ul>
               <xsl:for-each select="./item">
                  <li>
                     <h2><a href="{ link }"><xsl:value-of select="title" /></a></h2>
                     <p><xsl:value-of select="pubDate" /></p>
                     <p><xsl:value-of select="description" /></p>
                  </li>
               </xsl:for-each>
            </ul>
         </body>
      </html>
   </xsl:template>

</xsl:stylesheet>

(You can get fancier examples if you search for "rss xslt", or read some XSLT docs and edit the above however you want.)

littlebigman · 11-17-2022, 09:24 AM

Thanks, I don't know the first thing about XSLT.

For some reason, Perl only finds the first regex:

Code:

#OK
perl -pe 's@<title>(.+?)</title>\n@TITLE=$1\n@' input2.xml > output.xml

#NOK
perl -pe 's@<title>(.+?)</title>\n<link>(.+?)</link>\n@TITLE=$1 LINK=$2@' input2.xml > output.xml

#NOK
perl -pe 's@<title>(.+?)</title>\n<link>(.+?)</link>\n@TITLE=$1 LINK=$2\n@gsi' input2.xml > output.xml

#NOK
perl -pe 's@<title>(.+?)</title>\r\n<link>(.+?)</link>\r\n@TITLE=$1 LINK=$2\n@gsi' input2.xml > output.xml

Turbocapitalist · 11-17-2022, 09:38 AM

You need a proper parser to deal with XML, trying to fit it into regex won't work out.

If you would like to use Perl, then you can try the module XML::TreeBuilder to parse the XML and to generate your HTML. There are also corresponding Python modules.

XSLT is the other option.

dugan · 11-17-2022, 09:51 AM

Converting RSS to HTML is a large part of what my podcast client does.

https://github.com/duganchen/podfeeds

I wrote the code to do it myself because I couldn’t find anything prebuilt.

littlebigman · 11-17-2022, 09:51 AM

Thanks. Before writing a Python script, I wanted to see if I could get it to work with just the usual suspects (wget, grep, sed, and Perl).

How come even Perl can't find carriage returns?

Code:

perl -pe 's@<title>(.+?)</title>\r\n<link>(.+?)</link>\r\n@TITLE=$1 LINK=$2\n@gsi' input2.xml > output.xml

perl -pe 's@<title>(.+?)</title>\r\n<link>@TITLE=$1 LINK=@gsi' input2.xml > output.xml

perl -pe 's@<title>(.+?)</title>\n<link>@TITLE=$1 LINK=@gsi' input2.xml > output.xml

perl -pe 's@<title>(.+?)</title>\r<link>@TITLE=$1 LINK=@gsi' input2.xml > output.xml

perl -pe 's@<title>(.+?)</title>\R<link>@TITLE=$1 LINK=@gsi' input2.xml > output.xml

Turbocapitalist · 11-17-2022, 09:59 AM

Perl is probably easier than Python and aside from general XML parsing there is a module for RSS and Atom feeds specifically. From the manual page, loosely:

Code:

#!/usr/bin/perl                                                                 
                                                                                
use XML::Feed;                                                                  
use strict;                                                                     
use warnings;

my $feed = XML::Feed->parse(URI->new('https://example.com/feed/'))
    or die XML::Feed->errstr;

print $feed->title, "\n";

for my $entry ($feed->entries) {
    print $entry->title,"\n";
}

exit(0);

Untested. There should also be a Python feed library somewhere I expect.

littlebigman · 11-17-2022, 10:22 AM

Thank you. It's more involved than I thought.

I guess Perl eats carriage returns, hence the failed hits.

Turbocapitalist · 11-17-2022, 10:27 AM

Quote:

Originally Posted by littlebigman

How come even Perl can't find carriage returns?

It can. See "man perlre" in the section and "Modifiers" to look at the m option for multi-line matching in m// and s/// there. It is needed for multi-line matching. Most feeds will have only \n and not \r\n so the pattern would have to take that into account. But once more for emphasis, XML data requires a proper parser and cannot be managed with regex.

Also -p and -n read in one line at a time, as delimited by \n anyway. You'd have to set -0 to have the record separator be a null or something other than \n. See "man perlrun"

It's easier with a parser.

boughtonp · 11-17-2022, 10:32 AM

Quote:

Originally Posted by littlebigman

Thanks, I don't know the first thing about XSLT.

You don't need to!

XSLT is really simple if you want to learn it, but you can also just search for a template you like, save it, then run that one-line command.

Or using an existing RSS library in Python/Perl/whatever is another valid approach.

But trying to write your own RSS parser using line-based regex tools is the wrong approach.

littlebigman · 11-17-2022, 10:42 AM

This works:

Code:

wget -O feed.xml https://www.acme.com/feed/

#Find relevant lines, and remove leading tabs
grep -Poha "^\t{2}<(title|link|pubDate)>.+</(title|link|pubDate)>$" feed.xml |
grep -Poha "[^\t].+" > input.xml

#Build HTML file
echo '<!doctype html>' > feed.html
echo '<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8">' >> feed.html
echo '<link rel="shortcut icon" href="https://www.acme.com/favicon.ico" title="Favicon" /></head>' >> feed.html
echo '<body><h1>My feed</h1><ul>' >> feed.html

#Replace CRLF with #, turn into hyperlink
cat input.xml | tr '\n' '#' | perl -pe 's@<title>(.+?)</title>#<link>(.+?)</link>#<pubDate>(.+?)</pubDate>#@<li><a href="$2">$1 - $3</a></li>\n@gsi' >> feed.html

echo '</ul></body></html>>' >> feed.html

rm input.xml

Turbocapitalist · 11-17-2022, 10:54 AM

Quote:

Originally Posted by littlebigman

This works:

Code:

perl -p -0 -e 's@<title>(.+?)</title>#<link>(.+?)</link>#<pubDate>(.+?)</pubDate>#@<li><a href="$2">$1 - $3</a></li>\n@mgsi' input.xml > output.html

Note the s///m and -0 mentioned earlier. Again, that is a very brittle approach and neither portable nor enduring. XML and SGML are not to be parsed with regex. You might get away with it with just the one feed, for a limited time, but in the long run it will break.

See instead the example in post #7 above for a method which will work with all Atom or RSS feeds.

littlebigman · 11-17-2022, 10:58 AM

Thanks. I didn't see it because I didn't hit refresh before appending.

littlebigman · 11-18-2022, 04:37 AM

A bit simplified with Perl:

Code:

#Find relevant lines, and remove leading tabs
wget -qO - https://www.acme.com/feed/ | grep -Poha "^\t{2}<(title|link|pubDate)>.+</(title|link|pubDate)>$" | sed -r "s@\t@@g" > input.xml

#Build HTML file; header
echo '<!doctype html>' > feed.html
echo '<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8">' >> feed.html
echo '<link rel="shortcut icon" href="https://www.acme.com/favicon.ico" title="Favicon" /></head>' >> feed.html
echo '<body><h1>My feed</h1><ul>' >> feed.html

#Build list of hyperlinks
perl -0777 -pe 's@^<title>(.+)</title>\R<link>(.+?)</link>\R<pubDate>(.+?)</pubDate>\R@<li><a href="$2">$1 - $3</a></li>\r\n@mg' input.xml >> feed.html

#footer
echo '</ul></body></html>>' >> feed.html

#clean
rm input.xml

chrism01 · 11-21-2022, 11:42 PM

I have to agree with "But trying to write your own RSS parser using line-based regex tools is the wrong approach."
Even if you can it to work once, it IS brittle.
Definitely use a proper parser module in eg Perl - you'll thank us later.