LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   convert html emails to plain text emails (https://www.linuxquestions.org/questions/linux-general-1/convert-html-emails-to-plain-text-emails-150252/)

andredude 02-25-2004 03:12 AM

convert html emails to plain text emails
 
How can I convert html emails (fetched from an exchange server using fetchmail) to plain text emails before they are processed by procmail? I need to do this to properly implement our bugzilla bugmail system where users send their problems to a normal email address.

Thanks!

Andre

bruce1271 02-25-2004 11:00 PM

write a perl script.

andredude 03-01-2004 09:29 AM

right... erm, well i was actually wondering if this kind of functionality isn't provided by something standard like fetchmail or procmail already since i'm sure i'm not the first person to want all emails converted to plain text.

meldar 03-01-2004 09:42 AM

I can't help you finding a standard feature, but maybe http://userpage.fu-berlin.de/~mbayer...html2text.html could be handy? Allthough it is easier than writing a !"#¤%perl script :)

andredude 03-03-2004 03:03 AM

Thanks! This looks like it could help, I'll try it out.

andredude 04-02-2004 09:48 AM

ok... for anyone still interested in getting this right, this is how i finally got it to work. first, download html2text. then create a script containing this

awk '{x=substr($0, length($0)-1,2); if (x==" =") printf substr($0, 0, length($0)-1); else print $0;}' $1 > temp_clean_file.txt
x=`egrep -ni "^<\!DOCTYPE|^<HTML" temp_clean_file.txt | awk -F: '{print $1}' | head -1`
y=`grep -ni "^</html>" temp_clean_file.txt | awk -F: '{print $1}' | head -1`
head -$x temp_clean_file.txt > $2
tail +$x temp_clean_file.txt | head -$[$[$y-$x]+1]| html2text -nobs >> $2
tail +$[$y+1] temp_clean_file.txt >> $2


this will output your message (first parameter) into a plain text message (second parameter). the steps are basically: take all lines ending with " =", and append the following line at the end. then get the first line number with a <html> or <!doctype> tag, and the first number line with a </html> tag, this should be the html part of the message. you have to get these line numbers, because the first part is header info which you should not mess with, and the last part could be attachments or other messages (which I don't bother to convert here) which should also be left alone. then run these lines through html2text, and replace the original lines. the output is a file called temp_clean_file.txt

then put this into your .procmailrc to do whatever you wanted to do with your plain text email:
:0
RESULT=| cat > $MY_HOME/mfile && $MY_HOME/clean-html.sh $MY_HOME/mfile $MY_HOME/outfile && cat $MY_HOME/outfile | (cd $BUGZILLA_HOME && ./bugzilla_email_appen
d.pl)

I piped it into the bugzilla email gateway here but you can change that to whatever. So i cat the message into a file called mfile, then run the script above (which i've put into $MY_HOME/clean-html.sh) with this file and the output file (called $MY_HOME/outfile) and then I cat $MY_HOME/outfile into the bugzilla email gateway

I know the script and the procmail is dirty, but this way i have lots of scattered files lying around that I can look at to see what happened

osueerower 03-20-2005 12:33 PM

thanks for the insights provided here. my approach to convert HTML-only email to text also uses html2text, saves a copy of the original email in an htmlOnly maildir folder (since the conversion is lossy), and uses only procmailrc filtering to convert the body and change the content type header:

## Change html email to text
:0
* ^Content-Type: text/html;
{
:0c
$MAILDIR/htmlOnly/
:0fwb
| `which html2text`
:0fwh
| `which formail` -i "Content-Type: text/plain; charset=us-ascii"
LOG="HTML message found and converted...
"
}


All times are GMT -5. The time now is 04:58 PM.