LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices



Reply
 
Search this Thread
Old 03-27-2007, 07:57 AM   #1
Andrew_OC
LQ Newbie
 
Registered: Nov 2006
Posts: 26

Rep: Reputation: 15
Strip Mime & HTML from MBOX files


OK, This is quite a tricky request but I'm trying to convert the mail files from a legacy reader called Ameol2 to work with Thunderbird. I'm 80% there but have a snag I'm trying to resolve.

I can convert the mail folder files and import them into Thunderbird and 90% of the messages work AOK (even displaying the HTML version of a message!), but messages which had attachments don't display.

I have traced this down to what happens when Ameol2 decodes the attachment..

When you decode a message in A2 it saves it to a directory and then strips the trailing MIME part of the message ONLY from the message, leaving the "There's a mime attachment to this message" note in the header. Thunderbird then reads this but can't find the attachment (because Ameol2 has stripped it out), assumes it's a malformed message and doesn't display it

if I use grep -v to strip out all of the "There's a mime attachment to this message" header info then the HTML messages stop displaying and it looks a bit of a mess

Is there a program that can process my half converted mbox files and strip out all of the mime & html portions leaving only the plain text message versions (which I can live with)


Thanks in advance!

Andrew.
 
Old 03-27-2007, 08:32 AM   #2
Nick_Battle
Member
 
Registered: Dec 2006
Location: Bracknell, UK
Distribution: SUSE 13.1
Posts: 159

Rep: Reputation: 33
I'm not aware of a program to do this - from the sound of it, the MIME remaining in the mbox files is corrupt, which would make it difficult.

But you might be able to salvage quite a lot if you understand the MIME layout. Read through the RFC first of all, RFC1521. Basically the text parts will be topped and tailed by a line starting "--", then the sub-headers within those parts will include a Content-Type of "text/plain", possibly with other options following.

You should be able to use awk to locate such sections of text and spit them out without the surrounding MIMEery, but it will be tricky to do a perfect job.

HTH,
-nick
 
Old 03-27-2007, 10:16 AM   #3
Nick_Battle
Member
 
Registered: Dec 2006
Location: Bracknell, UK
Distribution: SUSE 13.1
Posts: 159

Rep: Reputation: 33
Can you post a (short!) complete message that has had an attachment removed? I'm thinking it might be easier to inject a dummy attachment than to extract the text... but it depends on exactly how the MIME is mangled.
 
Old 03-27-2007, 11:01 AM   #4
Andrew_OC
LQ Newbie
 
Registered: Nov 2006
Posts: 26

Original Poster
Rep: Reputation: 15
Ideas

Hi Nick,

Thanks for taking the time to think about my mail problem. I appreciate it.

I'll try and find some suitable messages to post as examples

It's quite tricky to isolate it down to a uniform set of if...then expressions.


The idea I had was to find a way to completely remove all the HTML & MIME sections from the mbox file (hopefully leaving only the plain text messages) and then perhaps replace the mime note in the header with one that specified plain text rather than "look there's an attachment"

eg replace Content-Type: multipart/alternative; with Content-Type: text/plain; charset="us-ascii"

I'm thinking this is quite a tricky suck-and-see problem/solution.

I wonder if formail would help ?

Andrew.
 
Old 03-27-2007, 11:49 AM   #5
Andrew_OC
LQ Newbie
 
Registered: Nov 2006
Posts: 26

Original Poster
Rep: Reputation: 15
here's a sample message:
http://www.pastebin.ca/412209

Paste this into a file and stuff into thunderbirds mail/inbox.sbd folder


Things get more complex when you have HTML parts in the message and undecoded attachemnets...
 
Old 03-27-2007, 12:15 PM   #6
Nick_Battle
Member
 
Registered: Dec 2006
Location: Bracknell, UK
Distribution: SUSE 13.1
Posts: 159

Rep: Reputation: 33
OK. That example is perfectly well formed, apart from the Content-Type, which as you say should be text/plain. If you changed that, the message should be acceptable to TB, and would include the text substituted for the PDF.

But presumably a naive substitution of all the Multipart/Mixed headers for text/plain would also zap perfectly good multipart/mixed messages that you have in there too?

I'm not familiar with formail.

Cheers,
-nick
 
Old 03-27-2007, 12:25 PM   #7
Andrew_OC
LQ Newbie
 
Registered: Nov 2006
Posts: 26

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by Nick_Battle
But presumably a naive substitution of all the Multipart/Mixed headers for text/plain would also zap perfectly good multipart/mixed messages that you have in there too?
That's the problem. If I have messages that hat HTML parts or valid attachments that haven't been decoded then it screws up all messages after the first one with a valid attachment if I remember my tests.
 
Old 03-27-2007, 01:38 PM   #8
Nick_Battle
Member
 
Registered: Dec 2006
Location: Bracknell, UK
Distribution: SUSE 13.1
Posts: 159

Rep: Reputation: 33
Quote:
Originally Posted by Andrew_OC
That's the problem. If I have messages that hat HTML parts or valid attachments that haven't been decoded then it screws up all messages after the first one with a valid attachment if I remember my tests.
OK. I need to see what one of these more complicated messages looks like. Can you post another - or mail me one directly (nick.battle@gmail.com). These were presumably a mixture of decoded/removed attachments and other MIME parts which weren't mangled?

Cheers,
-nick
 
Old 03-28-2007, 04:18 AM   #9
Nick_Battle
Member
 
Registered: Dec 2006
Location: Bracknell, UK
Distribution: SUSE 13.1
Posts: 159

Rep: Reputation: 33
Andrew, I looked at the mbox you sent. I've modified it and mailed it back to you.

For some reason, the file had short groups of headers, each with a valid "From" prefix. So firstly, these were being interpreted as separate messages with no content (and few headers!). Then, some of the messages that had had attachments removed still had Content-Type headers for a multi-part message, even though they were actually text/plain. Fixing that for the few cases that remained seemed to produce a working mailbox.

The mbox you sent only had 15 messages (not 17 as I said in my mail!), so manual repair was easy. If you need to perform this on a larger number of messages, it will be trickier, but awk should be able to cope.

HTH,
-nick

Last edited by Nick_Battle; 03-28-2007 at 04:31 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
help!! sendmail html MIME handling skyflakes690 Linux - General 0 05-16-2006 04:30 AM
Need help to strip XML & XSL tags from multiple files dfrechet Programming 9 10-12-2005 07:52 AM
strip html tags rblampain Programming 6 08-07-2005 07:22 AM
How can konqueror view html files in .gz & .bz2 files directly? ailinzhe Linux - Software 5 05-24-2004 09:36 AM
Perl - MIME/HTML mail Shak Programming 3 03-09-2003 08:43 PM


All times are GMT -5. The time now is 06:03 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration