LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Encountering Odd UTF-8 Error in Python (https://www.linuxquestions.org/questions/programming-9/encountering-odd-utf-8-error-in-python-4175650267/)

rnturn 03-16-2019 12:24 AM

Encountering Odd UTF-8 Error in Python
 
I'm writing a email scanner and I'm running into an error that doesn't seem to make any sense. At this stage, I'm mainly interested in knowing that I'm reading all of the email files so I'm not doing anything other than echoing the files to stdout line by line. The script is failing in this loop below:
Code:

with open( filepath, 'r', encoding='utf-8' ) as fh:
    for rec in fh:
        # eventually, do some clever stuff with the text in "rec"

The error message is confusing:
Code:

Traceback (most recent call last):
  File "<path-to-script>/scan_emails", line 59, in <module>
    for rec in fh:
  File "/usr/lib64/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1583: invalid start byte

I've removed the "encoding=" option and get the same failure.

If I issue "od -c" on the file where the error occurred, I see that character in the file (a9h = 251o = 169). (Oddly, it's not at the position displayed in the error message but I'm not all that concerned about that at the moment.) I have no idea what that character might be.

The headers in the email stated that the encoding is "UTF-8" though at this point I can tell you cannot trust what in the headers. My guess is that some odd Windows character that made it's way into the HTML in the email. There's only the one "bad" character though, who knows, the next file that fails could contain dozens.

Running "file" on the email file results in:
Code:

SMTP mail, ISO-8859 text, with very long lines
Other emails show other "types". (Mostly "ASCII", some "UTF-8", and a lonely "Macintosh HFS Extended".)

Questions:

* Aren't emails supposed to consist of only "printable" characters? (Silly me... I thought that was the whole idea behind using Base-64 encoding.)

* Is there an "open()" option that'll allow reading this character? Or, something that would likely work out just fine for my purposes, simply ignoring it?

* Given that the output from "file" might be different for each email file, am I stuck putting a complex set of tests in place to choose a proper encoding for each file? (Yes, everything in UNIX is a file but what kind of file? :^/ )

So far, all I've found on the 'Net hasn't worked.

Any ideas? They'd be appreciated.

UPDATE:

After brewing another pot-o-coffee and digging out a couple of references I had packed away, I found a solution. It's a little distasteful but I've gotten past the fatal errors by using either "errors='ignore'" or "errors='replace'" in the open(). Other files that I ran through this scanner showed bogus characters in other places within the emails---not just in the body as was the case that led me to start this thread. "Subject:" lines were a popular place to slip in odd characters. I guess, since I'm scanning emails that myself and others have marked as spam, I shouldn't have been surprised to see something like this. Maybe they're compliant to newer RFCs regarding email than I've read. Who knows. (It has been a decade or so since I last wrote code to fiddle around with email files.)

I'll leave this open for a bit in case someone has anything to add that's a neater solution.

--
Rick

ehartman 03-16-2019 04:16 AM

Quote:

Originally Posted by rnturn (Post 5974343)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1583: invalid start byte

0xA9 is a iso 8859-* encoding of the "copyright sign" (c in a circle) character, so maybe you should try one of (there are multiple ones) the iso8859 alphabets as encoding?
Or otherwise you should get and use the "charset=" info from the header line Content-Type: OF the email (if present, of course) and then use that encoding. ISO-8859-1 (or its variations) IS a valid encoding for E-mail, UTF-8 is NOT, as far as I know, because it is not a single byte (8-bits) one. The "start byte" your error is complaining about is one of a special set that would indicate in UTF-8 a multi-byte encoding.

From some of MY saved E-mails:
Quote:

Content-Type: text/plain; charset="iso-8859-1"
BTW: the word iso may be in lower AND upper case.

ntubski 03-16-2019 07:04 AM

Quote:

Originally Posted by rnturn (Post 5974343)
* Given that the output from "file" might be different for each email file, am I stuck putting a complex set of tests in place to choose a proper encoding for each file? (Yes, everything in UNIX is a file but what kind of file? :^/ )

Depending on how thoroughly you want to scan, it might help to read in binary mode (every kind of file in UNIX has bytes in it). Although if you want to make sense of non-ASCII text, figuring out the proper encoding will be necessary.

https://docs.python.org/3/library/functions.html#open
Quote:

Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding.

NevemTeve 03-16-2019 07:20 AM

* Aren't emails supposed to consist of only "printable" characters? (Silly me... I thought that was the whole idea behind using Base-64 encoding.)

Definition of 'printable' depends on header-fields, eg:
Code:

Content-Type: text/plain; charset=ISO-8859-2 | UTF-8 | ...
Content-Transfer-Encoding: 8bit | quoted-printable | base64

* Is there an "open()" option that'll allow reading this character? Or, something that would likely work out just fine for my purposes, simply ignoring it?

In ISO-8859-x every character is valid, so you won't get errors. Or there should be a binary mode.

* Given that the output from "file" might be different for each email file, am I stuck putting a complex set of tests in place to choose a proper encoding for each file? (Yes, everything in UNIX is a file but what kind of file? :^/ )

Sure. Don't forget multipart letters: every part has its own settings.


All times are GMT -5. The time now is 04:21 AM.