LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
LinkBack Search this Thread
Old 12-14-2007, 12:42 PM   #1
hsocasnavarro
LQ Newbie
 
Registered: Dec 2007
Posts: 6

Rep: Reputation: 0
Question Need help: Converting ISO-8859-1 to plain ASCII


Hi folks,

I wrote a little script to parse the header of incoming messages and read out the from and subject fields using the festival speech synthesizer. It works pretty well, except when I receive messages with non-ASCII encoding (typically ISO-8859-1). For example, if I get a message from David Pérez, the From: field of my message reads:

From: =?iso-8859-1?Q?David P=E9rez ?= <address@nowhere.com>

I can use sed to get rid of the question marks but how can I convert the acute e (é) to a plain e? I tried using iconv but it doesn't make any difference.

Thanks!

Last edited by hsocasnavarro; 12-14-2007 at 02:52 PM. Reason: More descriptive title
 
Old 12-14-2007, 10:47 PM   #2
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 64
Quote:
Originally Posted by hsocasnavarro View Post
I tried using iconv but it doesn't make any difference.
Well, you must not have tried hard enough… it definitely can be done. All the examples that I use assume my terminal and shell is inputing ISO-8859-1 characters directly (I had to make a temporary switch from the usual UTF-8, even though the idea is the same).

Here’s the naïve approach:
Code:
$ echo "naïve" | iconv -f ISO-8859-1 -t ASCII
naiconv: illegal input sequence at position 2
The problem is that when iconv() reaches the “ï” character, it stops because there is no “ï” in ASCII (if you had instead said “iconv -f ISO-8859-1 -t UTF-8” the program would have succeeded since both ISO-8859-1 and UTF-8 have representations of the character “ï”).

You have to tell iconv that you will accept an approximate encoding (in this case “ï” can be approximated by “i”). You do this by appending //TRANSLIT to the target encoding. For example, try this:
Code:
$ echo "naïve" | iconv -f ISO-8859-1 -t ASCII//TRANSLIT
naive
Btw, the problem you are trying to solve is known by the general term text normalization. If you want something a little more powerful than iconv, I think both perl and java have some featureful text normalizing abilities.
 
Old 12-14-2007, 11:14 PM   #3
hsocasnavarro
LQ Newbie
 
Registered: Dec 2007
Posts: 6

Original Poster
Rep: Reputation: 0
Right, but that is not the problem. The problem is that the line that I get from the mail server (using fetchmail and grep) already has the question marks instead of the non-ascii characters. That's why I gave the example above:

From: =?iso-8859-1?Q?David P=E9rez ?= <address@nowhere.com>

So, if I do the following:

Code:
echo "From: =?iso-8859-1?Q?David P=E9rez ?= <address@nowhere.com>" | iconv -f ISO-8859-1 -t ASCII//TRANSLIT
I get the same output

From: =?iso-8859-1?Q?David P=E9rez ?= <address@nowhere.com>
 
Old 12-15-2007, 12:32 PM   #4
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 64
Quote:
Originally Posted by hsocasnavarro View Post
Right, but that is not the problem. The problem is that the line that I get from the mail server (using fetchmail and grep) already has the question marks instead of the non-ascii characters.
I see… so you want to decode a so-called “encoded-word” (c.f. RFC 2047). Encoded words come along because the mail headers should always be 7-bit (i.e., ASCII) even if they contain 8-bit characters. The format of an encoded word is the following:
Code:
=?CHARSET?ENCODING?MESSAGE?=
ENCODING can be either Q or B (case-insensitive), representing quoted-printable or base64 encodings respectively. Now the problem comes: how to decode the encoding? The encodings themselves are rather straightforward, but I am not aware of any standard tools that decode these for you. For base64, I guess the most commonly-available tool would be openssl:
Code:
$ echo "VGhpcyBzdHJpbmcgaXMgaW4gYmFzZTY0Cg==" | openssl enc -d -a
This string is in base64
I am unaware of something similar for quoted-printable, so I guess we’ll have to fashion our own. Instead of doing the gruntwork ourselves, we’ll use the nice perl module MIME::QuotedPrint. Here is a sample program (which I will place in qp_decode.pl):
Code:
#!/usr/bin/perl

use MIME::QuotedPrint;

while (<>) {
	s/_/=20/g;
	print decode_qp($_);
}
So now, we can even do this:
Code:
$ echo "David=20P=E9rez" | ./qp_decode.pl
David Pérez
Or this:
Code:
$ echo "David=20P=E9rez" | ./qp_decode.pl | iconv -f ISO-8859-1 -t ASCII//TRANSLIT
David Perez
As for the text filtering, I think it would not be so easy to do in plain sed (watch someone prove me wrong). What I am talking about is taking a message that looks like this:
Code:
To: Osor <sserdda@erehwon.com>
From: =?iso-8859-1?Q?David=20P=E9rez?= <address@nowhere.com>
Subject: =?utf-8?B?T3NvciBpcyBuYcOvdmUK?=
and putting through a text filter which makes the resulting output look like this:
Code:
To: Osor <sserdda@erehwon.com>
From: David Perez <address@nowhere.com>
Subject: Osor is naive
The text filter would do both decoding and normalization (as you can see). I think this would be easiest to do in perl, but I leave the implementation as an exercise for you .

Last edited by osor; 12-17-2007 at 04:44 PM. Reason: substitute =20 instead of space for underscore
 
Old 12-15-2007, 11:21 PM   #5
hsocasnavarro
LQ Newbie
 
Registered: Dec 2007
Posts: 6

Original Poster
Rep: Reputation: 0
Ahhhhh....
This looks very nice indeed. Let me play around a little and I'll get back to you (either with more questions or my gratitude)

Thanks
 
Old 12-16-2007, 11:33 PM   #6
hsocasnavarro
LQ Newbie
 
Registered: Dec 2007
Posts: 6

Original Poster
Rep: Reputation: 0
Great, it works just fine.
I wrote the text filter in C using the following basic assumptions:
1)Any string that has a =? contains encoded content
2)Anything between =? and ?= is an encoded string. It is parsed to find charset and encoding. If these are known, then the perl script is executed along with iconv to do the conversion to ascii//translit

While this seems to work I'm not sure that these assumptions are correct 100% of the time. I mean, couldn't I have some plain ASCII text that just by chance happens to contain a =?
 
Old 12-17-2007, 01:19 PM   #7
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 64
Quote:
Originally Posted by hsocasnavarro View Post
I wrote the text filter in C using the following basic assumptions:
1)Any string that has a =? contains encoded content
2)Anything between =? and ?= is an encoded string. It is parsed to find charset and encoding. If these are known, then the perl script is executed along with iconv to do the conversion to ascii//translit
I don’t think those assumptions are 100% correct. If you want the technical answer, you can look at section 2 of the relevant rfc.

Basically the rules say this: any string that begins with “=?” and ends with “?=” and that contains two other instances of “?” in between and contains no more than 75 characters total (including the delimiters and the end-delimiters) is a candidate for being an encoded word. The problem is that there are more rules (such as the prohibition of using spaces), some of which are ignored by message generators. In your parsing, you have to strike a balance between accepting strict encoded-words and accepting malformed encoded words which are generated by buggy software.

As an example, your original header is not a strict encoded word (since it contains a space). E.g., this:
Code:
From: =?iso-8859-1?Q?David P=E9rez?= <address@nowhere.com>
should ideally be this:
Code:
From: =?iso-8859-1?Q?David=20P=E9rez?= <address@nowhere.com>
this:
Code:
From: =?iso-8859-1?Q?David_P=E9rez?= <address@nowhere.com>
or even this:
Code:
From: David =?iso-8859-1?Q?P=E9rez?= <address@nowhere.com>
Anyway, there are some other characters that are forbidden as well (for example, in the first field, you cannot have certain “especials”).

The whole thing screams regular expressions, so I think it would be somewhat easier in perl rather than C. For example, here is a text filter in perl:
Code:
#!/usr/bin/perl

use MIME::Base64;
use MIME::QuotedPrint;
use Text::Iconv;

sub decode;

while (<>) {
	s/=\?([^?]+)\?([qb])\?([^?]+)\?=/${\decode($1, $2, $3)}/ig;
#	s/=\?([^? \(\)<>\@,;:\.\/\[\][:cntrl:]=]+)\?([qb])\?([^? [:^graph:]]+)\?=/${\decode($1, $2, $3)}/ig;
	print;
}

sub decode {
	my ($charset, $encoding, $message) = @_;
	my ($decoded, $normalized);

	my $converter = Text::Iconv->new($charset, "ASCII//TRANSLIT");

	if ($encoding =~ /q/i) {
		$message =~ s/_/=20/g;
		$decoded = decode_qp($message);
	} else {
		$decoded = decode_base64($message);
	}

	$normalized = $converter->convert($decoded);
	
	return $normalized;
}
The commented line (in red) contains the stricter version of the regular expression (although it doesn’t do length checking). Let me explain the first, more lenient version: We are looking for a regular expression that consists of “=?”, followed by one or non-question-mark characters, followed by a question mark, followed by the letter ‘q’ or ‘b’ (case-insensitive), followed by a question mark, followed by a sequence of one or more non-question mark characters followed by “?=”. The first sequence of non-question mark characters is put into a grouping, the letter ‘q’ or ‘b’ is put into a grouping, and the final sequence of non-question-mark characters is put into a grouping. Backreferences to those groupings are used as the arguments to a subroutine which decodes and normalizes each encoded word.

As for the normalization, I still use iconv (though it’s through the perl module Text::Iconv). Although iconv is very convenient, it is also very system-dependent (i.e., non-portable), especially when dealing with //TRANSLIT. If you want a more robust (and more portable) normalization, I suggest using something other than //TRANSLIT (for example, you might do iconv to get the string into utf-8 and then normalize it into NFKD, then strip away any combining and other non-ASCII characters).

Anyway, the text filter works fine on a system with GNU iconv (text in blue should be typed at the command line):
Code:
$ ./filter.pl << EOF
>To: Osor <sserdda@erehwon.com>
>From: =?iso-8859-1?Q?David=20P=E9rez?= <address@nowhere.com>
>Subject: =?iso-8859-1?q?David=20P=E9rez?= says =?utf-8?B?T3NvciBpcyBuYcOvdmU=?=
>EOF
To: Osor <sserdda@erehwon.com>
From: David Perez <address@nowhere.com>
Subject: David Perez says Osor is naive

Last edited by osor; 12-17-2007 at 04:44 PM. Reason: substitute =20 instead of space for underscore
 
Old 12-17-2007, 04:44 PM   #8
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 64
Btw, I seem to have made an error in my last post, Specifically, the part that says:
Code:
s/_/ /g
Should be changed to say:
Code:
s/_/=20/g
The reason is taken from section 4.2, clause 2, which states, “the "_" always represents hexadecimal 20, even if the SPACE character occupies a different code position in the character set in use.” This technicality won’t matter for any of the most commonly used charsets (e.g., utf-8, iso-8859-*, KOI8-{R,U}, etc.), but it does affect other rarer charsets (e.g., EBCDIC, where space is 0x40 instead of 0x20).

I have edited both post 4 and post 7 accordingly.
 
Old 12-17-2007, 11:13 PM   #9
hsocasnavarro
LQ Newbie
 
Registered: Dec 2007
Posts: 6

Original Poster
Rep: Reputation: 0
Thanks a lot for all this info, it's been really helpful.

I guess my question still remains. If I get a message that happens to contain a =? and a ?= there's not way to tell that from an encoded string, right?
 
Old 12-17-2007, 11:49 PM   #10
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 64
Quote:
Originally Posted by hsocasnavarro View Post
If I get a message that happens to contain a =? and a ?= there's not way to tell that from an encoded string, right?
Wrong. Only if it matches the other rules is there no way to tell the message from an encoded string. Here’s a case where it contains both end-delimiters but no in-between ones:
Code:
$ echo '=?ThisIsACoincidence?=' | ./filter.pl
=?ThisIsACoincidence?=
Here’s a separate case where it doesn’t work:
Code:
$ echo '=?What?What?What?=' | ./filter.pl
=?What?What?What?=
Encoded-word syntax was devised in such a way as to make it unlikely to occur accidentally in a mail header. The text filter from post 7 isn’t foolproof though. Ideally, it the subroutine decode should detect an error (which might occur in three places along the path of execution) and return the original string back (according to the rfc, a malformed encoded word should be unmangled by the reader software).

Also remember that encoded words only occur in mail headers. You should not apply the text filter on the body of the message, only on the headers. There are other ways to encode message bodies (using e.g., using content-type with charset and content-transfer-encoding) be they multi-part or otherwise. If what would be an encoded word in a mail header occurs in the body, it should be left alone.
 
  


Reply

Tags
ascii, character


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Lacking support for ISO-8859-1 NNP Linux - Software 2 02-07-2007 08:01 AM
No plain ascii editor? sunlion Mandriva 4 01-03-2006 01:35 AM
Suse and ISO-8859-15 charset badbunny Linux - Newbie 2 10-21-2005 03:44 AM
Red Had 9 SRPM ISO or plain-old ISO? Creole Red Hat 1 09-15-2003 06:59 AM
iso 8859 or iso 9960 tsundram Linux - Newbie 16 02-22-2002 10:32 PM


All times are GMT -5. The time now is 06:37 AM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration