Hi,
I've been programming in Perl for a while now but still haven't found the best way to solve this problem.
I am writing a POP3 Client and fetch program to get the E-Mail from the POP3Client and insert it into a MySQL database. The main problem is that email is not readable when it is pulled from the database.
The emails initially displayed like this:
Code:
This is a multi-part message in MIME format.
------=_NextPart_000_0001_01C2E11D.5C73CF70
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
mr han man
------=_NextPart_000_0001_01C2E11D.5C73CF70
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<html xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns=3D"http://www.w3.org/TR/REC-html40">
<head>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<meta name=3DProgId content=3DWord.Document>
<meta name=3DGenerator content=3D"Microsoft Word 10">
<meta name=3DOriginator content=3D"Microsoft Word 10">
<link rel=3DFile-List href=3D"cid:filelist.xml@01C2E11D.5C1079C0">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:DoNotRelyOnCSS/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:SpellingState>Clean</w:SpellingState>
<w:GrammarState>Clean</w:GrammarState>
<w:DocumentKind>DocumentEmail</w:DocumentKind>
<w:EnvelopeVis/>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]-->
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;
text-underline:single;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;
text-underline:single;}
span.EmailStyle17
{mso-style-type:personal-compose;
mso-style-noshow:yes;
mso-ansi-font-size:10.0pt;
mso-bidi-font-size:10.0pt;
font-family:Arial;
mso-ascii-font-family:Arial;
mso-hansi-font-family:Arial;
mso-bidi-font-family:Arial;
color:windowtext;}
span.SpellE
{mso-style-name:"";
mso-spl-e:yes;}
@page Section1
{size:612.0pt 792.0pt;
margin:72.0pt 90.0pt 72.0pt 90.0pt;
mso-header-margin:35.4pt;
mso-footer-margin:35.4pt;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
/* Style Definitions */=20
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
</head>
<body lang=3DEN-US link=3Dblue vlink=3Dpurple =
style=3D"tab-interval:36.0pt">
<div class=3DSection1>
<p class=3DMsoNormal><font size=3D3 face=3D"Times New Roman"><span =
lang=3DEN-GB
style=3D"font-size:12.0pt;mso-ansi-language:EN-GB"><span =
class=3DSpellE>mr</span>
<span class=3DSpellE>han</span> man<o:p></o:p></span></font></p>
</div>
</body>
</html>
------=_NextPart_000_0001_01C2E11D.5C73CF70--
This obviously is a problem as the user can't read the file. This is demonstrated here:
http://www.unixshak.org.uk:8080/hlpd....php?ticket=34
I then tried implementing a very simple regular expression s/<(?.*)>//sg on the string containing the mail before it is inserted - this worked, to a point but still left most of the mail unreadable. I then added to regular expression that is recommended by perldoc -q remove.html and that works fine :D
Really the aim of the whole exercise is to _completely_ obliterate any other features that are left within the email :|
If you look here:
http://www.unixshak.org.uk:8080/hlpdsk/ then you can see that the system works with all plaintext email:
http://www.unixshak.org.uk:8080/hlpd....php?ticket=43
It then struggles with the other HTML and MIME encoded parts of the emails :(
Does anyone have the ultimate solution on how to get rid of all of the formatting text? I played with MIME::Parser and MIME::Body last night, and it didnt really get what I needed :/
Any help would be much appreciated.
Thanks in Advance,
Shak