LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 03-06-2003, 06:58 AM   #1
Shak
Member
 
Registered: May 2002
Location: Huddersfield
Distribution: Redhat (7.2, 7.3, 8.0), Debian, Slackware, Gentoo, FreeBSD
Posts: 169

Rep: Reputation: 30
Perl - MIME/HTML mail


Hi,

I've been programming in Perl for a while now but still haven't found the best way to solve this problem.

I am writing a POP3 Client and fetch program to get the E-Mail from the POP3Client and insert it into a MySQL database. The main problem is that email is not readable when it is pulled from the database.

The emails initially displayed like this:

Code:
 
This is a multi-part message in MIME format.

------=_NextPart_000_0001_01C2E11D.5C73CF70
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

mr han man

------=_NextPart_000_0001_01C2E11D.5C73CF70
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns=3D"http://www.w3.org/TR/REC-html40">

<head>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">


<meta name=3DProgId content=3DWord.Document>
<meta name=3DGenerator content=3D"Microsoft Word 10">
<meta name=3DOriginator content=3D"Microsoft Word 10">
<link rel=3DFile-List href=3D"cid:filelist.xml@01C2E11D.5C1079C0">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:DoNotRelyOnCSS/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:SpellingState>Clean</w:SpellingState>
<w:GrammarState>Clean</w:GrammarState>
<w:DocumentKind>DocumentEmail</w:DocumentKind>
<w:EnvelopeVis/>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]-->
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;
text-underline:single;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;
text-underline:single;}
span.EmailStyle17
{mso-style-type:personal-compose;
mso-style-noshow:yes;
mso-ansi-font-size:10.0pt;
mso-bidi-font-size:10.0pt;
font-family:Arial;
mso-ascii-font-family:Arial;
mso-hansi-font-family:Arial;
mso-bidi-font-family:Arial;
color:windowtext;}
span.SpellE
{mso-style-name:"";
mso-spl-e:yes;}
@page Section1
{size:612.0pt 792.0pt;
margin:72.0pt 90.0pt 72.0pt 90.0pt;
mso-header-margin:35.4pt;
mso-footer-margin:35.4pt;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
/* Style Definitions */=20
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";}
</style>
<![endif]-->
</head>

<body lang=3DEN-US link=3Dblue vlink=3Dpurple =
style=3D"tab-interval:36.0pt">

<div class=3DSection1>

<p class=3DMsoNormal><font size=3D3 face=3D"Times New Roman"><span =
lang=3DEN-GB
style=3D"font-size:12.0pt;mso-ansi-language:EN-GB"><span =
class=3DSpellE>mr</span>
<span class=3DSpellE>han</span> man<o:p></o:p></span></font></p>

</div>

</body>

</html>

------=_NextPart_000_0001_01C2E11D.5C73CF70--
This obviously is a problem as the user can't read the file. This is demonstrated here: http://www.unixshak.org.uk:8080/hlpd....php?ticket=34

I then tried implementing a very simple regular expression s/<(?.*)>//sg on the string containing the mail before it is inserted - this worked, to a point but still left most of the mail unreadable. I then added to regular expression that is recommended by perldoc -q remove.html and that works fine :D

Really the aim of the whole exercise is to _completely_ obliterate any other features that are left within the email :|

If you look here: http://www.unixshak.org.uk:8080/hlpdsk/ then you can see that the system works with all plaintext email: http://www.unixshak.org.uk:8080/hlpd....php?ticket=43

It then struggles with the other HTML and MIME encoded parts of the emails :(

Does anyone have the ultimate solution on how to get rid of all of the formatting text? I played with MIME::Parser and MIME::Body last night, and it didnt really get what I needed :/

Any help would be much appreciated.

Thanks in Advance,

Shak
 
Old 03-07-2003, 09:48 PM   #2
joesbox
Member
 
Registered: Feb 2003
Location: hampton va
Distribution: ubuntu
Posts: 502

Rep: Reputation: 30
sorry to interupt but i am working on a program that i need to parse thru html files and delete all html coding and leave the text. i see that you may have something that i need. if i am reading your post correctly you have a regular expression to remove all html??
Quote:
I then added to regular expression that is recommended by perldoc -q remove.html and that works fine
am i right or just hoping??? btw sorry for interupting your thread.
 
Old 03-09-2003, 01:42 PM   #3
Shak
Member
 
Registered: May 2002
Location: Huddersfield
Distribution: Redhat (7.2, 7.3, 8.0), Debian, Slackware, Gentoo, FreeBSD
Posts: 169

Original Poster
Rep: Reputation: 30
Run perldoc -q and there is not only a Regular expression but a link to a perl script that will remove _all_ HTML from a document. Unfortunately that does not suffice for my problem.

Shak
 
Old 03-09-2003, 07:43 PM   #4
Shak
Member
 
Registered: May 2002
Location: Huddersfield
Distribution: Redhat (7.2, 7.3, 8.0), Debian, Slackware, Gentoo, FreeBSD
Posts: 169

Original Poster
Rep: Reputation: 30
Ok, I solved the problem. I did some research into the problem and Ive found that the Outlook mail is really just souped up 2 part MIME messages. Now Perl has an array of modules (available from CPAN) for MIME, if you're interested check out MIME::Tools. I used MIME::Parser, MIME::Entity and MIME::Body. Below is the code I used to solve the problem (I added some extra comments as its out of context):

Code:
              # This is part of Mail::POP3Client to get the headers and body of the POP3 mail in question
               $body = $connection->HeadAndBody($i);
               # Parse the message with MIME::Parser, declare the body as an entitty
                $msg = $parser->parse_data($body);
                # Find out if this is a multipart MIME message or just a plaintext
                $num_parts=$msg->parts;
                # So its its got 0 parts i.e. is a plaintext
                if ($num_parts eq 0) {
                # Get the message by POP3Client
                $message = $connection->Body($i);
                # Use this series of regular expressions to verify that its ok for MySQL
                $message =~ s/</&lt;/g;
                $message =~ s/>/&gt;/g;
                $message =~ s/'//g;
                                      }
                else {
                      # If it is MIME the parse the first part (the plaintext) into a string
                     $message = $msg->parts(0)->bodyhandle->as_string;
                      }
SImple really, just took some working out

Shak

Last edited by Shak; 03-09-2003 at 07:44 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
perl -MCPAN -e 'install MIME::Base64' fail singying304 Linux - Software 7 11-25-2005 07:50 PM
Java Mail fails to send a HTML mail eantoranz Programming 1 11-10-2004 01:47 PM
How to send an mime/html-email by command line ? fluppi Linux - Networking 0 07-08-2004 09:26 AM
cgi perl : I cant get perl to append my html file... the_y_man Programming 3 03-22-2004 05:07 AM
parsing mail.log with perl and calculate mail traffic on domain base markus1982 Programming 1 03-18-2003 06:22 AM


All times are GMT -5. The time now is 12:23 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration