LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-05-2016, 12:05 PM   #1
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,620
Blog Entries: 40

Rep: Reputation: Disabled
HTML parsing problems: Is this a valid HTML mail?


Good evening.

I am having a hard time parsing mail-content, as soon as it comes in “HTML-format”. As I program against mails which communicate a great diversity of opinions about what “HTML-mail” should look like, my program can only handle a subset of all that. And I do no longer try to prepare it for everything.

However. Sometimes I am undecided and see a chance that an error is not on my side.

In the example that I show you below this post, it is the <a/> tag (the HTML-link) which is not recognized. I wonder if I handle the combination of Content-type and Content-Transfer-Encoding correctly and always wonder, why a HTML mail has to be encoded as Quoted-Printable in the first place...
Can you tell me, if the <a/> tags must be coded this way? In this case I have to diversify my (already numerous) approaches to recognize individual tags in the mess-, pardon: mail. There is no plain-text-alternative attached, of course. Who needs standards.

I attach below the PDF-output of my program cremefraiche, which converts eml-files to PDF (directly from my mailer Mutt, if you have not guessed).

Here is the example-mail with the important headers:

Code:
(...)
Content-Type: text/html; charset=utf-8
Subject: =?UTF-8?Q?R=C3=A9ception?= de votre commande de Manuel d'Economie Critique
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME::Lite 3.028 (F2.82; T1.35; A2.09; B3.13; Q3.13)
Date: Mon, 5 Sep 2016 14:01:03 +0200
Content-Length: 2213
Lines: 58

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.=
w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
=09
	<head>
	<title>Le Monde diplomatique</title>
	<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8" />
	</head>
=09
	<body bgcolor=3D"#FFFFFF">  <table style=3D"font-size: 14px; width: 100%;m=
argin:auto;text-align:center;font-family:Lucida Grande,Verdana,sans-serif;"=
><tr><td>Si ce message ne s'affiche pas correctement, <a href=3D"http://t.n=
ews.mindbaz.com/c/?t=3Dfb739fc-i0-43i3-9ma-s0ex" target=3D"_blank">cliquez-=
ici</a></td></tr></table>
	<table align=3D"center" border=3D"0" cellpadding=3D"0" cellspacing=3D"0" w=
idth=3D"650">
		<tbody>
=09=09
		<tr>
			<td>
				<img src=3D"http://imgrp.news.mindbaz.com/390/1510/logoMD_OK.jpg" alt=
=3D"" height=3D"57" width=3D"211"/></td>
		</tr>
=09=09
		<tr>
			<td>
				<p align=3D"justify">
					<br />
					<br />
					<font face=3D"Verdana, Arial, Helvetica, sans-serif"><font size=3D"2">=
Num&eacute;ro d'abonn&eacute; : 0010309295<br />
						<br />
						Madame, Monsieur,<br />
						<br />
						<br />
						Vous avez command&eacute; le Manuel d'Economie Critique et nous vous =
remercions de votre confiance.<br />
						Nous souhaitons vous informer que l'ouvrage vous parviendra &agrave; =
compter du 9&nbsp;septembre au lieu du 8.<br />
						Si vous r&eacute;sidez &agrave; l'&eacute;tranger, ce d&eacute;lai pe=
ut &ecirc;tre rallong&eacute; de quelques jours. Nous vous remercions pour =
votre compr&eacute;hension.<br />
						<br />
						Nous restons &agrave; votre &eacute;coute pour toute information comp=
l&eacute;mentaire et vous souhaitons une bonne lecture.<br />
						<br />
						L'&eacute;quipe du <i>Monde diplomatique</i><br />
					</font> </font></p>
				</td>
		</tr>
	</tbody>
</table>
 <table style=3D"font-size: 0px; color =3D "#ffffff"; width: 100%;margin:au=
to;text-align:center;font-family:Lucida Grande,Verdana,sans-serif;"><tr><td=
><a href=3D"http://t.news.mindbaz.com/c/?t=3Dfb739fc-i0-43iw-9ma-s0ex" targ=
et=3D"_blank"> </a></td></tr></table>  <img src=3D"http://t.news.mindbaz.co=
m/o/?t=3Di0-9ma-s0ex" width=3D"1" height=3D"1" alt=3D""/> </body>
</html>=
Attached Files
File Type: pdf msg001_test.emlmsg.pdf (20.5 KB, 17 views)

Last edited by Michael Uplawski; 09-05-2016 at 12:12 PM. Reason: Forgot the question.
 
Old 09-05-2016, 12:25 PM   #2
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
the message seems to have some formatting errors.
i cleaned it up a little, but even so, the second <a></a> remains invisible, because there's no text.
i inserted the text "XxXxXxXxXxXx" so you can see what i mean. also the font size for that <table> was set to 0; i set it to 10.
this looks really fishy, inserting invisible links and images into emails.

Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.=
w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>

	<head>
	<title>Le Monde diplomatique</title>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	</head>

	<body bgcolor="#FFFFFF">  <table style="font-size: 14px; width: 100%;margin:auto;text-align:center;font-family:Lucida Grande,Verdana,sans-serif;"><tr><td>Si ce message ne s'affiche pas correctement, <a href="http://t.news.mindbaz.com/c/?t=fb739fc-i0-43i3-9ma-s0ex" target="_blank">cliquez-
ici</a></td></tr></table>
	<table align="center" border="0" cellpadding="0" cellspacing="0" width="650">
		<tbody>

		<tr>
			<td>
				<img src="http://imgrp.news.mindbaz.com/390/1510/logoMD_OK.jpg" alt
="" height="57" width="211"/></td>
		</tr>

		<tr>
			<td>
				<p align="justify">
					<br />
					<br />
					<font face="Verdana, Arial, Helvetica, sans-serif"><font size="2">
Num&eacute;ro d'abonn&eacute; : 0010309295<br />
						<br />
						Madame, Monsieur,<br />
						<br />
						<br />
						Vous avez command&eacute; le Manuel d'Economie Critique et nous vous 
remercions de votre confiance.<br />
						Nous souhaitons vous informer que l'ouvrage vous parviendra &agrave; 
compter du 9&nbsp;septembre au lieu du 8.<br />
						Si vous r&eacute;sidez &agrave; l'&eacute;tranger, ce d&eacute;lai pe
ut &ecirc;tre rallong&eacute; de quelques jours. Nous vous remercions pour 
votre compr&eacute;hension.<br />
						<br />
						Nous restons &agrave; votre &eacute;coute pour toute information comp
l&eacute;mentaire et vous souhaitons une bonne lecture.<br />
						<br />
						L'&eacute;quipe du <i>Monde diplomatique</i><br />
					</font> </font></p>
				</td>
		</tr>
	</tbody>
</table>
 <table style="font-size: 10px; color = #ffffff; width: 100%;margin:auto;text-align:center;font-family:Lucida Grande,Verdana,sans-serif;"><tr><td><a href="http://t.news.mindbaz.com/c/?tfb739fc-i0-43iw-9ma-s0ex" target="_blank"> XxXxXxXxXxXx</a></td></tr></table>  <img src="http://t.news.mindbaz.com/o/?t=i0-9ma-s0ex" width="1" height="1" alt=""/> </body>
</html>
 
Old 09-05-2016, 12:43 PM   #3
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,620

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Quote:
Originally Posted by ondoho View Post
the message seems to have some formatting errors.
i cleaned it up a little, but even so, the second <a></a> remains invisible, because there's no text.
i inserted the text "XxXxXxXxXxXx" so you can see what i mean. also the font size for that <table> was set to 0; i set it to 10.
this looks really fishy, inserting invisible links and images into emails.
Thank you ondoho, I appreciate the trouble you have taken to clean up my example mail.
However, most Quoted-Printable code is not a big problem, as I decode mail-content systematically before it is munged either as a HTML- or a plain-text mail by the PDF-generator in my program.

But it is in deed the Quoted-Printable encoding of the links which I suspect to be either faulty (, superficial) or badly handled by myself. The tag-handlers are called recursively, so that the natural, nested structure of tags should in itself not cause a problem.

What I am hoping for is that someone spots something that I have overlooked so far or tells me: “Do not bother with this kind of messy broken mail.
 
Old 09-05-2016, 01:12 PM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
but why would someone include invisible links in their mail.
i think this is fishy, a scam, maybe even potentially mal.
 
Old 09-05-2016, 02:31 PM   #5
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,620

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Quote:
Originally Posted by ondoho View Post
but why would someone include invisible links in their mail.
i think this is fishy, a scam, maybe even potentially mal.
Don't worry.
It is a response to an order that I have made via Internet and all that the message says is that the ordered book will be delivered one day later than planned.

Most of the format-trouble is though certainly due to the mail and its HTML-code being generated by yet another dumb web-application of some service-provider. But it is relatively short, and that is why I chose it as an example.

Last edited by Michael Uplawski; 09-05-2016 at 02:33 PM. Reason: short
 
Old 09-06-2016, 12:04 AM   #6
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,620

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
... on a side-note: Is <br />, with the space before the slash, valid HTML? As the so-called HTML5 “standard” has been awaited long enough for me to lose interest, I deem it possible that we are about to re-live the old days of “do as you please and be your own standard.”.

In the example mail, above, only the br-tag includes such a space, not even the <img/> is formatted that way... Maybe my tag-handlers must be coded for sender-addresses, more than for tags.
 
Old 09-06-2016, 03:34 AM   #7
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Yes, HTML5 is a lot more forgiving. In the old days, I remember Netscape would simply crash when the HTML was fuzzy. Then came XHTML. We should start writing XML instead. I guess it has some use. We have better tools and the standard is a lot more strict. But what should a browser do with some incorrect XML? Simply saying that the web page is not valid XML is not a good solution. Also, HTML can include stuff like javascript, so it becomes very tedious to do it properly.

Code:
<script type="text/javascript">
//<![CDATA[
alert("Hi");
//]]>
</script>
With HTML5, the / character inside tags is ignored if it's not the first character. You can actually write this

<br a/b=3>

Because the / and things before it is ignored, it's the same as this:

<br b="3">

With libxml2, you can parse "crazy" html so you shouldn't have to worry about it.

http://xmlsoft.org/html/libxml-HTMLparser.html
 
1 members found this post helpful.
Old 09-06-2016, 04:18 AM   #8
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,620

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Did you find this post helpful”. I'd say the “[extremely]” button is missing.

Thank you Guttorm. Your information does not only explain how they alternate lightheartedly between the closed-tag formats, but also why I will just try to detect the difficult parsing-tasks and give out a warning, in the future, rather than try to handle each possible variation in my ... “software”... whatever.

I read on the libxml2 page: "real world" HTML, even if severely broken from a specification point of view.
Even if I locate Ruby-bindings to libxml2, this looks a lot like capitulation to the stupidity of the world. I am still not sure what to do. Maybe the pain is not yet big enough.

Edit: This is a better place for my “Maybe HTML” icon.
Attached Thumbnails
Click image for larger version

Name:	maybe_html5.png
Views:	3
Size:	2.5 KB
ID:	22952  

Last edited by Michael Uplawski; 09-06-2016 at 04:25 AM. Reason: Maybe HTML, can be (who knows) displayed in one browser or another
 
Old 09-07-2016, 02:35 AM   #9
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,620

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
Since my last post, above, this is what has happened:
  1. I compared the Ruby-bindings for libxml2 to nokogiri
  2. I noticed (after years, that I use nokogiri) that nokogiri is itself based on libxml2
  3. Trying to exploit a little better the capacities of the libxml2 library, I noticed that I cannot set the caption of an XML-tag <link/> which is needed by the PDF-generator (Prawn) to render hyperlinks.
  4. Resorting to replacing the <link/> node by a string “<link href="...">caption</link>solved the problem that I had with the example-mail and others.
  5. Feeling inclined to finally give my ruby-gem “Crème Fraîche” a version 1.0 I thoroughly tested all functionality and discovered a glitch in the graphical user-interface, where a function has been deprecated in the current GTK3 bindings for Ruby.
  6. Instead of version 1.0, I released a new Gem, version 0.8.4, instead. I must screw some holes in walls, draw some cables, harvest some vegetables and am currently too frustrated to work on the GUI-problem. It currently concernes only the configuration-dialog where some options for the creation of the PDF-file are set. But it looks ugly. I'll write a Blog-post, when this problem is fixed.

Here is the current gem: Crème Fraîche. (WIP)
And below you find a corrected version of the PDF_file that I just created with cremefraiche 0.8.4 from the exact same mail, shown above. Note the link.
Attached Files
File Type: pdf 001_Le_Monde_diplomatique.eml.pdf (5.1 KB, 12 views)

Last edited by Michael Uplawski; 09-07-2016 at 02:41 AM.
 
Old 09-08-2016, 02:05 PM   #10
Michael Uplawski
Senior Member
 
Registered: Dec 2015
Posts: 1,620

Original Poster
Blog Entries: 40

Rep: Reputation: Disabled
A new Blog-entry presents Crème Fraîche as a command-line utility: Transforminator
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
HTML parsing library Dark_Helmet Programming 1 04-27-2006 07:43 AM
Parsing out html with egrep binaryechoes Linux - Newbie 3 12-02-2005 12:41 AM
HTML parsing library nodger Programming 1 09-01-2005 01:42 AM
HTML parsing with HTML::TreeBuilder smaida Programming 0 07-10-2005 09:58 PM
Parsing HTML using Perl smaida Programming 2 05-29-2004 01:20 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:02 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration