Encountering Odd UTF-8 Error in Python
I'm writing a email scanner and I'm running into an error that doesn't seem to make any sense. At this stage, I'm mainly interested in knowing that I'm reading all of the email files so I'm not doing anything other than echoing the files to stdout line by line. The script is failing in this loop below:
Code:
with open( filepath, 'r', encoding='utf-8' ) as fh: Code:
Traceback (most recent call last): If I issue "od -c" on the file where the error occurred, I see that character in the file (a9h = 251o = 169). (Oddly, it's not at the position displayed in the error message but I'm not all that concerned about that at the moment.) I have no idea what that character might be. The headers in the email stated that the encoding is "UTF-8" though at this point I can tell you cannot trust what in the headers. My guess is that some odd Windows character that made it's way into the HTML in the email. There's only the one "bad" character though, who knows, the next file that fails could contain dozens. Running "file" on the email file results in: Code:
SMTP mail, ISO-8859 text, with very long lines Questions: * Aren't emails supposed to consist of only "printable" characters? (Silly me... I thought that was the whole idea behind using Base-64 encoding.) * Is there an "open()" option that'll allow reading this character? Or, something that would likely work out just fine for my purposes, simply ignoring it? * Given that the output from "file" might be different for each email file, am I stuck putting a complex set of tests in place to choose a proper encoding for each file? (Yes, everything in UNIX is a file but what kind of file? :^/ ) So far, all I've found on the 'Net hasn't worked. Any ideas? They'd be appreciated. UPDATE: After brewing another pot-o-coffee and digging out a couple of references I had packed away, I found a solution. It's a little distasteful but I've gotten past the fatal errors by using either "errors='ignore'" or "errors='replace'" in the open(). Other files that I ran through this scanner showed bogus characters in other places within the emails---not just in the body as was the case that led me to start this thread. "Subject:" lines were a popular place to slip in odd characters. I guess, since I'm scanning emails that myself and others have marked as spam, I shouldn't have been surprised to see something like this. Maybe they're compliant to newer RFCs regarding email than I've read. Who knows. (It has been a decade or so since I last wrote code to fiddle around with email files.) I'll leave this open for a bit in case someone has anything to add that's a neater solution. -- Rick |
Quote:
Or otherwise you should get and use the "charset=" info from the header line Content-Type: OF the email (if present, of course) and then use that encoding. ISO-8859-1 (or its variations) IS a valid encoding for E-mail, UTF-8 is NOT, as far as I know, because it is not a single byte (8-bits) one. The "start byte" your error is complaining about is one of a special set that would indicate in UTF-8 a multi-byte encoding. From some of MY saved E-mails: Quote:
|
Quote:
https://docs.python.org/3/library/functions.html#open Quote:
|
* Aren't emails supposed to consist of only "printable" characters? (Silly me... I thought that was the whole idea behind using Base-64 encoding.)
Definition of 'printable' depends on header-fields, eg: Code:
Content-Type: text/plain; charset=ISO-8859-2 | UTF-8 | ... In ISO-8859-x every character is valid, so you won't get errors. Or there should be a binary mode. * Given that the output from "file" might be different for each email file, am I stuck putting a complex set of tests in place to choose a proper encoding for each file? (Yes, everything in UNIX is a file but what kind of file? :^/ ) Sure. Don't forget multipart letters: every part has its own settings. |
All times are GMT -5. The time now is 04:21 AM. |