LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (http://www.linuxquestions.org/questions/linux-server-73/)
-   -   Strip Mime & HTML from MBOX files (http://www.linuxquestions.org/questions/linux-server-73/strip-mime-and-html-from-mbox-files-540997/)

Andrew_OC 03-27-2007 06:57 AM

Strip Mime & HTML from MBOX files
 
OK, This is quite a tricky request but I'm trying to convert the mail files from a legacy reader called Ameol2 to work with Thunderbird. I'm 80% there but have a snag I'm trying to resolve.

I can convert the mail folder files and import them into Thunderbird and 90% of the messages work AOK (even displaying the HTML version of a message!), but messages which had attachments don't display.

I have traced this down to what happens when Ameol2 decodes the attachment..

When you decode a message in A2 it saves it to a directory and then strips the trailing MIME part of the message ONLY from the message, leaving the "There's a mime attachment to this message" note in the header. Thunderbird then reads this but can't find the attachment (because Ameol2 has stripped it out), assumes it's a malformed message and doesn't display it :(

if I use grep -v to strip out all of the "There's a mime attachment to this message" header info then the HTML messages stop displaying and it looks a bit of a mess :(

Is there a program that can process my half converted mbox files and strip out all of the mime & html portions leaving only the plain text message versions (which I can live with)


Thanks in advance!

Andrew.

Nick_Battle 03-27-2007 07:32 AM

I'm not aware of a program to do this - from the sound of it, the MIME remaining in the mbox files is corrupt, which would make it difficult.

But you might be able to salvage quite a lot if you understand the MIME layout. Read through the RFC first of all, RFC1521. Basically the text parts will be topped and tailed by a line starting "--", then the sub-headers within those parts will include a Content-Type of "text/plain", possibly with other options following.

You should be able to use awk to locate such sections of text and spit them out without the surrounding MIMEery, but it will be tricky to do a perfect job.

HTH,
-nick

Nick_Battle 03-27-2007 09:16 AM

Can you post a (short!) complete message that has had an attachment removed? I'm thinking it might be easier to inject a dummy attachment than to extract the text... but it depends on exactly how the MIME is mangled.

Andrew_OC 03-27-2007 10:01 AM

Ideas
 
Hi Nick,

Thanks for taking the time to think about my mail problem. I appreciate it.

I'll try and find some suitable messages to post as examples

It's quite tricky to isolate it down to a uniform set of if...then expressions.


The idea I had was to find a way to completely remove all the HTML & MIME sections from the mbox file (hopefully leaving only the plain text messages) and then perhaps replace the mime note in the header with one that specified plain text rather than "look there's an attachment"

eg replace Content-Type: multipart/alternative; with Content-Type: text/plain; charset="us-ascii"

I'm thinking this is quite a tricky suck-and-see problem/solution.

I wonder if formail would help ?

Andrew.

Andrew_OC 03-27-2007 10:49 AM

here's a sample message:
http://www.pastebin.ca/412209

Paste this into a file and stuff into thunderbirds mail/inbox.sbd folder


Things get more complex when you have HTML parts in the message and undecoded attachemnets...

Nick_Battle 03-27-2007 11:15 AM

OK. That example is perfectly well formed, apart from the Content-Type, which as you say should be text/plain. If you changed that, the message should be acceptable to TB, and would include the text substituted for the PDF.

But presumably a naive substitution of all the Multipart/Mixed headers for text/plain would also zap perfectly good multipart/mixed messages that you have in there too?

I'm not familiar with formail.

Cheers,
-nick

Andrew_OC 03-27-2007 11:25 AM

Quote:

Originally Posted by Nick_Battle
But presumably a naive substitution of all the Multipart/Mixed headers for text/plain would also zap perfectly good multipart/mixed messages that you have in there too?

That's the problem. If I have messages that hat HTML parts or valid attachments that haven't been decoded then it screws up all messages after the first one with a valid attachment if I remember my tests.

Nick_Battle 03-27-2007 12:38 PM

Quote:

Originally Posted by Andrew_OC
That's the problem. If I have messages that hat HTML parts or valid attachments that haven't been decoded then it screws up all messages after the first one with a valid attachment if I remember my tests.

OK. I need to see what one of these more complicated messages looks like. Can you post another - or mail me one directly (nick.battle@gmail.com). These were presumably a mixture of decoded/removed attachments and other MIME parts which weren't mangled?

Cheers,
-nick

Nick_Battle 03-28-2007 03:18 AM

Andrew, I looked at the mbox you sent. I've modified it and mailed it back to you.

For some reason, the file had short groups of headers, each with a valid "From" prefix. So firstly, these were being interpreted as separate messages with no content (and few headers!). Then, some of the messages that had had attachments removed still had Content-Type headers for a multi-part message, even though they were actually text/plain. Fixing that for the few cases that remained seemed to produce a working mailbox.

The mbox you sent only had 15 messages (not 17 as I said in my mail!), so manual repair was easy. If you need to perform this on a larger number of messages, it will be trickier, but awk should be able to cope.

HTH,
-nick


All times are GMT -5. The time now is 07:21 PM.