Quote:
Originally Posted by hsocasnavarro
I wrote the text filter in C using the following basic assumptions:
1)Any string that has a =? contains encoded content
2)Anything between =? and ?= is an encoded string. It is parsed to find charset and encoding. If these are known, then the perl script is executed along with iconv to do the conversion to ascii//translit
|
I don’t think those assumptions are 100% correct. If you want the technical answer, you can look at
section 2 of the relevant rfc.
Basically the rules say this: any string that begins with “
=?” and ends with “
?=” and that contains two other instances of “
?” in between and contains no more than 75 characters total (including the delimiters and the end-delimiters) is a candidate for being an encoded word. The problem is that there are more rules (such as the prohibition of using spaces), some of which are ignored by message generators. In your parsing, you have to strike a balance between accepting strict encoded-words and accepting malformed encoded words which are generated by buggy software.
As an example, your original header is
not a strict encoded word (since it contains a space). E.g., this:
Code:
From: =?iso-8859-1?Q?David P=E9rez?= <address@nowhere.com>
should ideally be this:
Code:
From: =?iso-8859-1?Q?David=20P=E9rez?= <address@nowhere.com>
this:
Code:
From: =?iso-8859-1?Q?David_P=E9rez?= <address@nowhere.com>
or even this:
Code:
From: David =?iso-8859-1?Q?P=E9rez?= <address@nowhere.com>
Anyway, there are some other characters that are forbidden as well (for example, in the first field, you cannot have certain “especials”).
The whole thing screams regular expressions, so I think it would be somewhat easier in perl rather than C. For example, here is a text filter in perl:
Code:
#!/usr/bin/perl
use MIME::Base64;
use MIME::QuotedPrint;
use Text::Iconv;
sub decode;
while (<>) {
s/=\?([^?]+)\?([qb])\?([^?]+)\?=/${\decode($1, $2, $3)}/ig;
# s/=\?([^? \(\)<>\@,;:\.\/\[\][:cntrl:]=]+)\?([qb])\?([^? [:^graph:]]+)\?=/${\decode($1, $2, $3)}/ig;
print;
}
sub decode {
my ($charset, $encoding, $message) = @_;
my ($decoded, $normalized);
my $converter = Text::Iconv->new($charset, "ASCII//TRANSLIT");
if ($encoding =~ /q/i) {
$message =~ s/_/=20/g;
$decoded = decode_qp($message);
} else {
$decoded = decode_base64($message);
}
$normalized = $converter->convert($decoded);
return $normalized;
}
The commented line (in
red) contains the stricter version of the regular expression (although it doesn’t do length checking). Let me explain the first, more lenient version: We are looking for a regular expression that consists of “
=?”, followed by one or non-question-mark characters, followed by a question mark, followed by the letter ‘q’ or ‘b’ (case-insensitive), followed by a question mark, followed by a sequence of one or more non-question mark characters followed by “
?=”. The first sequence of non-question mark characters is put into a grouping, the letter ‘q’ or ‘b’ is put into a grouping, and the final sequence of non-question-mark characters is put into a grouping. Backreferences to those groupings are used as the arguments to a subroutine which decodes and normalizes each encoded word.
As for the normalization, I still use
iconv (though it’s through the perl module
Text::Iconv). Although
iconv is very convenient, it is also very system-dependent (i.e., non-portable), especially when dealing with
//TRANSLIT. If you want a more robust (and more portable) normalization, I suggest using something other than
//TRANSLIT (for example, you might do
iconv to get the string into utf-8 and then normalize it into NFKD, then strip away any combining and other non-ASCII characters).
Anyway, the text filter works fine on a system with GNU iconv (text in
blue should be typed at the command line):
Code:
$ ./filter.pl << EOF
>To: Osor <sserdda@erehwon.com>
>From: =?iso-8859-1?Q?David=20P=E9rez?= <address@nowhere.com>
>Subject: =?iso-8859-1?q?David=20P=E9rez?= says =?utf-8?B?T3NvciBpcyBuYcOvdmU=?=
>EOF
To: Osor <sserdda@erehwon.com>
From: David Perez <address@nowhere.com>
Subject: David Perez says Osor is naive