LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Perl - MIME/HTML mail (http://www.linuxquestions.org/questions/programming-9/perl-mime-html-mail-48546/)

Shak 03-06-2003 06:58 AM

Perl - MIME/HTML mail
 
Hi,

I've been programming in Perl for a while now but still haven't found the best way to solve this problem.

I am writing a POP3 Client and fetch program to get the E-Mail from the POP3Client and insert it into a MySQL database. The main problem is that email is not readable when it is pulled from the database.

The emails initially displayed like this:

Code:


This is a multi-part message in MIME format.

------=_NextPart_000_0001_01C2E11D.5C73CF70
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

mr han man

------=_NextPart_000_0001_01C2E11D.5C73CF70
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns=3D"http://www.w3.org/TR/REC-html40">

<head>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">


<meta name=3DProgId content=3DWord.Document>
<meta name=3DGenerator content=3D"Microsoft Word 10">
<meta name=3DOriginator content=3D"Microsoft Word 10">
<link rel=3DFile-List href=3D"cid:filelist.xml@01C2E11D.5C1079C0">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:DoNotRelyOnCSS/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:SpellingState>Clean</w:SpellingState>
<w:GrammarState>Clean</w:GrammarState>
<w:DocumentKind>DocumentEmail</w:DocumentKind>
<w:EnvelopeVis/>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]-->
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;
text-underline:single;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;
text-underline:single;}
span.EmailStyle17
{mso-style-type:personal-compose;
mso-style-noshow:yes;
mso-ansi-font-size:10.0pt;
mso-bidi-font-size:10.0pt;
font-family:Arial;
mso-ascii-font-family:Arial;
mso-hansi-font-family:Arial;
mso-bidi-font-family:Arial;
color:windowtext;}
span.SpellE
{mso-style-name:"";
mso-spl-e:yes;}
@page Section1
{size:612.0pt 792.0pt;
margin:72.0pt 90.0pt 72.0pt 90.0pt;
mso-header-margin:35.4pt;
mso-footer-margin:35.4pt;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
/* Style Definitions */=20
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
</head>

<body lang=3DEN-US link=3Dblue vlink=3Dpurple =
style=3D"tab-interval:36.0pt">

<div class=3DSection1>

<p class=3DMsoNormal><font size=3D3 face=3D"Times New Roman"><span =
lang=3DEN-GB
style=3D"font-size:12.0pt;mso-ansi-language:EN-GB"><span =
class=3DSpellE>mr</span>
<span class=3DSpellE>han</span> man<o:p></o:p></span></font></p>

</div>

</body>

</html>

------=_NextPart_000_0001_01C2E11D.5C73CF70--

This obviously is a problem as the user can't read the file. This is demonstrated here: http://www.unixshak.org.uk:8080/hlpd....php?ticket=34

I then tried implementing a very simple regular expression s/<(?.*)>//sg on the string containing the mail before it is inserted - this worked, to a point but still left most of the mail unreadable. I then added to regular expression that is recommended by perldoc -q remove.html and that works fine :D

Really the aim of the whole exercise is to _completely_ obliterate any other features that are left within the email :|

If you look here: http://www.unixshak.org.uk:8080/hlpdsk/ then you can see that the system works with all plaintext email: http://www.unixshak.org.uk:8080/hlpd....php?ticket=43

It then struggles with the other HTML and MIME encoded parts of the emails :(

Does anyone have the ultimate solution on how to get rid of all of the formatting text? I played with MIME::Parser and MIME::Body last night, and it didnt really get what I needed :/

Any help would be much appreciated.

Thanks in Advance,

Shak

joesbox 03-07-2003 09:48 PM

sorry to interupt but i am working on a program that i need to parse thru html files and delete all html coding and leave the text. i see that you may have something that i need. if i am reading your post correctly you have a regular expression to remove all html??
Quote:

I then added to regular expression that is recommended by perldoc -q remove.html and that works fine :D
am i right or just hoping??? btw sorry for interupting your thread.

Shak 03-09-2003 01:42 PM

Run perldoc -q and there is not only a Regular expression but a link to a perl script that will remove _all_ HTML from a document. Unfortunately that does not suffice for my problem.

Shak

Shak 03-09-2003 07:43 PM

Ok, I solved the problem. I did some research into the problem and Ive found that the Outlook mail is really just souped up 2 part MIME messages. Now Perl has an array of modules (available from CPAN) for MIME, if you're interested check out MIME::Tools. I used MIME::Parser, MIME::Entity and MIME::Body. Below is the code I used to solve the problem (I added some extra comments as its out of context):

Code:

              # This is part of Mail::POP3Client to get the headers and body of the POP3 mail in question
              $body = $connection->HeadAndBody($i);
              # Parse the message with MIME::Parser, declare the body as an entitty
                $msg = $parser->parse_data($body);
                # Find out if this is a multipart MIME message or just a plaintext
                $num_parts=$msg->parts;
                # So its its got 0 parts i.e. is a plaintext
                if ($num_parts eq 0) {
                # Get the message by POP3Client
                $message = $connection->Body($i);
                # Use this series of regular expressions to verify that its ok for MySQL
                $message =~ s/</&lt;/g;
                $message =~ s/>/&gt;/g;
                $message =~ s/'//g;
                                      }
                else {
                      # If it is MIME the parse the first part (the plaintext) into a string
                    $message = $msg->parts(0)->bodyhandle->as_string;
                      }

SImple really, just took some working out :)

Shak


All times are GMT -5. The time now is 05:23 PM.