[SOLVED] HTML parsing problems: Is this a valid HTML mail?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am having a hard time parsing mail-content, as soon as it comes in “HTML-format”. As I program against mails which communicate a great diversity of opinions about what “HTML-mail” should look like, my program can only handle a subset of all that. And I do no longer try to prepare it for everything.
However. Sometimes I am undecided and see a chance that an error is not on my side.
In the example that I show you below this post, it is the <a/> tag (the HTML-link) which is not recognized. I wonder if I handle the combination of Content-type and Content-Transfer-Encoding correctly and always wonder, why a HTML mail has to be encoded as Quoted-Printable in the first place...
Can you tell me, if the <a/> tags must be coded this way? In this case I have to diversify my (already numerous) approaches to recognize individual tags in the mess-, pardon: mail. There is no plain-text-alternative attached, of course. Who needs standards.
I attach below the PDF-output of my program cremefraiche, which converts eml-files to PDF (directly from my mailer Mutt, if you have not guessed).
Here is the example-mail with the important headers:
Code:
(...)
Content-Type: text/html; charset=utf-8
Subject: =?UTF-8?Q?R=C3=A9ception?= de votre commande de Manuel d'Economie Critique
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME::Lite 3.028 (F2.82; T1.35; A2.09; B3.13; Q3.13)
Date: Mon, 5 Sep 2016 14:01:03 +0200
Content-Length: 2213
Lines: 58
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.=
w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
=09
<head>
<title>Le Monde diplomatique</title>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8" />
</head>
=09
<body bgcolor=3D"#FFFFFF"> <table style=3D"font-size: 14px; width: 100%;m=
argin:auto;text-align:center;font-family:Lucida Grande,Verdana,sans-serif;"=
><tr><td>Si ce message ne s'affiche pas correctement, <a href=3D"http://t.n=
ews.mindbaz.com/c/?t=3Dfb739fc-i0-43i3-9ma-s0ex" target=3D"_blank">cliquez-=
ici</a></td></tr></table>
<table align=3D"center" border=3D"0" cellpadding=3D"0" cellspacing=3D"0" w=
idth=3D"650">
<tbody>
=09=09
<tr>
<td>
<img src=3D"http://imgrp.news.mindbaz.com/390/1510/logoMD_OK.jpg" alt=
=3D"" height=3D"57" width=3D"211"/></td>
</tr>
=09=09
<tr>
<td>
<p align=3D"justify">
<br />
<br />
<font face=3D"Verdana, Arial, Helvetica, sans-serif"><font size=3D"2">=
Numéro d'abonné : 0010309295<br />
<br />
Madame, Monsieur,<br />
<br />
<br />
Vous avez commandé le Manuel d'Economie Critique et nous vous =
remercions de votre confiance.<br />
Nous souhaitons vous informer que l'ouvrage vous parviendra à =
compter du 9 septembre au lieu du 8.<br />
Si vous résidez à l'étranger, ce délai pe=
ut être rallongé de quelques jours. Nous vous remercions pour =
votre compréhension.<br />
<br />
Nous restons à votre écoute pour toute information comp=
lémentaire et vous souhaitons une bonne lecture.<br />
<br />
L'équipe du <i>Monde diplomatique</i><br />
</font> </font></p>
</td>
</tr>
</tbody>
</table>
<table style=3D"font-size: 0px; color =3D "#ffffff"; width: 100%;margin:au=
to;text-align:center;font-family:Lucida Grande,Verdana,sans-serif;"><tr><td=
><a href=3D"http://t.news.mindbaz.com/c/?t=3Dfb739fc-i0-43iw-9ma-s0ex" targ=
et=3D"_blank"> </a></td></tr></table> <img src=3D"http://t.news.mindbaz.co=
m/o/?t=3Di0-9ma-s0ex" width=3D"1" height=3D"1" alt=3D""/> </body>
</html>=
Last edited by Michael Uplawski; 09-05-2016 at 12:12 PM.
Reason: Forgot the question.
the message seems to have some formatting errors.
i cleaned it up a little, but even so, the second <a></a> remains invisible, because there's no text.
i inserted the text "XxXxXxXxXxXx" so you can see what i mean. also the font size for that <table> was set to 0; i set it to 10.
this looks really fishy, inserting invisible links and images into emails.
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.=
w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Le Monde diplomatique</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body bgcolor="#FFFFFF"> <table style="font-size: 14px; width: 100%;margin:auto;text-align:center;font-family:Lucida Grande,Verdana,sans-serif;"><tr><td>Si ce message ne s'affiche pas correctement, <a href="http://t.news.mindbaz.com/c/?t=fb739fc-i0-43i3-9ma-s0ex" target="_blank">cliquez-
ici</a></td></tr></table>
<table align="center" border="0" cellpadding="0" cellspacing="0" width="650">
<tbody>
<tr>
<td>
<img src="http://imgrp.news.mindbaz.com/390/1510/logoMD_OK.jpg" alt
="" height="57" width="211"/></td>
</tr>
<tr>
<td>
<p align="justify">
<br />
<br />
<font face="Verdana, Arial, Helvetica, sans-serif"><font size="2">
Numéro d'abonné : 0010309295<br />
<br />
Madame, Monsieur,<br />
<br />
<br />
Vous avez commandé le Manuel d'Economie Critique et nous vous
remercions de votre confiance.<br />
Nous souhaitons vous informer que l'ouvrage vous parviendra à
compter du 9 septembre au lieu du 8.<br />
Si vous résidez à l'étranger, ce délai pe
ut être rallongé de quelques jours. Nous vous remercions pour
votre compréhension.<br />
<br />
Nous restons à votre écoute pour toute information comp
lémentaire et vous souhaitons une bonne lecture.<br />
<br />
L'équipe du <i>Monde diplomatique</i><br />
</font> </font></p>
</td>
</tr>
</tbody>
</table>
<table style="font-size: 10px; color = #ffffff; width: 100%;margin:auto;text-align:center;font-family:Lucida Grande,Verdana,sans-serif;"><tr><td><a href="http://t.news.mindbaz.com/c/?tfb739fc-i0-43iw-9ma-s0ex" target="_blank"> XxXxXxXxXxXx</a></td></tr></table> <img src="http://t.news.mindbaz.com/o/?t=i0-9ma-s0ex" width="1" height="1" alt=""/> </body>
</html>
the message seems to have some formatting errors.
i cleaned it up a little, but even so, the second <a></a> remains invisible, because there's no text.
i inserted the text "XxXxXxXxXxXx" so you can see what i mean. also the font size for that <table> was set to 0; i set it to 10.
this looks really fishy, inserting invisible links and images into emails.
Thank you ondoho, I appreciate the trouble you have taken to clean up my example mail.
However, most Quoted-Printable code is not a big problem, as I decode mail-content systematically before it is munged either as a HTML- or a plain-text mail by the PDF-generator in my program.
But it is in deed the Quoted-Printable encoding of the links which I suspect to be either faulty (, superficial) or badly handled by myself. The tag-handlers are called recursively, so that the natural, nested structure of tags should in itself not cause a problem.
What I am hoping for is that someone spots something that I have overlooked so far or tells me: “Do not bother with this kind of messy broken mail.”
but why would someone include invisible links in their mail.
i think this is fishy, a scam, maybe even potentially mal.
Don't worry.
It is a response to an order that I have made via Internet and all that the message says is that the ordered book will be delivered one day later than planned.
Most of the format-trouble is though certainly due to the mail and its HTML-code being generated by yet another dumb web-application of some service-provider. But it is relatively short, and that is why I chose it as an example.
Last edited by Michael Uplawski; 09-05-2016 at 02:33 PM.
Reason: short
... on a side-note: Is <br />, with the space before the slash, valid HTML? As the so-called HTML5 “standard” has been awaited long enough for me to lose interest, I deem it possible that we are about to re-live the old days of “do as you please and be your own standard.”.
In the example mail, above, only the br-tag includes such a space, not even the <img/> is formatted that way... Maybe my tag-handlers must be coded for sender-addresses, more than for tags.
Yes, HTML5 is a lot more forgiving. In the old days, I remember Netscape would simply crash when the HTML was fuzzy. Then came XHTML. We should start writing XML instead. I guess it has some use. We have better tools and the standard is a lot more strict. But what should a browser do with some incorrect XML? Simply saying that the web page is not valid XML is not a good solution. Also, HTML can include stuff like javascript, so it becomes very tedious to do it properly.
“Did you find this post helpful”. I'd say the “[extremely]” button is missing.
Thank you Guttorm. Your information does not only explain how they alternate lightheartedly between the closed-tag formats, but also why I will just try to detect the difficult parsing-tasks and give out a warning, in the future, rather than try to handle each possible variation in my ... “software”... whatever.
I read on the libxml2 page: "real world" HTML, even if severely broken from a specification point of view.
Even if I locate Ruby-bindings to libxml2, this looks a lot like capitulation to the stupidity of the world. I am still not sure what to do. Maybe the pain is not yet big enough.
Edit: This is a better place for my “Maybe HTML” icon.
Last edited by Michael Uplawski; 09-06-2016 at 04:25 AM.
Reason: Maybe HTML, can be (who knows) displayed in one browser or another
Since my last post, above, this is what has happened:
I compared the Ruby-bindings for libxml2 to nokogiri
I noticed (after years, that I use nokogiri) that nokogiri is itself based on libxml2
Trying to exploit a little better the capacities of the libxml2 library, I noticed that I cannot set the caption of an XML-tag <link/> which is needed by the PDF-generator (Prawn) to render hyperlinks.
Resorting to replacing the <link/> node by a string “<link href="...">caption</link>” solved the problem that I had with the example-mail and others.
Feeling inclined to finally give my ruby-gem “Crème Fraîche” a version 1.0 I thoroughly tested all functionality and discovered a glitch in the graphical user-interface, where a function has been deprecated in the current GTK3 bindings for Ruby.
Instead of version 1.0, I released a new Gem, version 0.8.4, instead. I must screw some holes in walls, draw some cables, harvest some vegetables and am currently too frustrated to work on the GUI-problem. It currently concernes only the configuration-dialog where some options for the creation of the PDF-file are set. But it looks ugly. I'll write a Blog-post, when this problem is fixed.
Here is the current gem: Crème Fraîche. (WIP)
And below you find a corrected version of the PDF_file that I just created with cremefraiche 0.8.4 from the exact same mail, shown above. Note the link.
Last edited by Michael Uplawski; 09-07-2016 at 02:41 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.