LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-16-2019, 12:24 AM   #1
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Encountering Odd UTF-8 Error in Python


I'm writing a email scanner and I'm running into an error that doesn't seem to make any sense. At this stage, I'm mainly interested in knowing that I'm reading all of the email files so I'm not doing anything other than echoing the files to stdout line by line. The script is failing in this loop below:
Code:
with open( filepath, 'r', encoding='utf-8' ) as fh:
    for rec in fh:
         # eventually, do some clever stuff with the text in "rec"
The error message is confusing:
Code:
Traceback (most recent call last):
  File "<path-to-script>/scan_emails", line 59, in <module>
    for rec in fh:
  File "/usr/lib64/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1583: invalid start byte
I've removed the "encoding=" option and get the same failure.

If I issue "od -c" on the file where the error occurred, I see that character in the file (a9h = 251o = 169). (Oddly, it's not at the position displayed in the error message but I'm not all that concerned about that at the moment.) I have no idea what that character might be.

The headers in the email stated that the encoding is "UTF-8" though at this point I can tell you cannot trust what in the headers. My guess is that some odd Windows character that made it's way into the HTML in the email. There's only the one "bad" character though, who knows, the next file that fails could contain dozens.

Running "file" on the email file results in:
Code:
SMTP mail, ISO-8859 text, with very long lines
Other emails show other "types". (Mostly "ASCII", some "UTF-8", and a lonely "Macintosh HFS Extended".)

Questions:

* Aren't emails supposed to consist of only "printable" characters? (Silly me... I thought that was the whole idea behind using Base-64 encoding.)

* Is there an "open()" option that'll allow reading this character? Or, something that would likely work out just fine for my purposes, simply ignoring it?

* Given that the output from "file" might be different for each email file, am I stuck putting a complex set of tests in place to choose a proper encoding for each file? (Yes, everything in UNIX is a file but what kind of file? :^/ )

So far, all I've found on the 'Net hasn't worked.

Any ideas? They'd be appreciated.

UPDATE:

After brewing another pot-o-coffee and digging out a couple of references I had packed away, I found a solution. It's a little distasteful but I've gotten past the fatal errors by using either "errors='ignore'" or "errors='replace'" in the open(). Other files that I ran through this scanner showed bogus characters in other places within the emails---not just in the body as was the case that led me to start this thread. "Subject:" lines were a popular place to slip in odd characters. I guess, since I'm scanning emails that myself and others have marked as spam, I shouldn't have been surprised to see something like this. Maybe they're compliant to newer RFCs regarding email than I've read. Who knows. (It has been a decade or so since I last wrote code to fiddle around with email files.)

I'll leave this open for a bit in case someone has anything to add that's a neater solution.

--
Rick

Last edited by rnturn; 03-16-2019 at 05:43 AM. Reason: Update
 
Old 03-16-2019, 04:16 AM   #2
ehartman
Senior Member
 
Registered: Jul 2007
Location: Delft, The Netherlands
Distribution: Slackware
Posts: 1,674

Rep: Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888
Quote:
Originally Posted by rnturn View Post
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1583: invalid start byte
0xA9 is a iso 8859-* encoding of the "copyright sign" (c in a circle) character, so maybe you should try one of (there are multiple ones) the iso8859 alphabets as encoding?
Or otherwise you should get and use the "charset=" info from the header line Content-Type: OF the email (if present, of course) and then use that encoding. ISO-8859-1 (or its variations) IS a valid encoding for E-mail, UTF-8 is NOT, as far as I know, because it is not a single byte (8-bits) one. The "start byte" your error is complaining about is one of a special set that would indicate in UTF-8 a multi-byte encoding.

From some of MY saved E-mails:
Quote:
Content-Type: text/plain; charset="iso-8859-1"
BTW: the word iso may be in lower AND upper case.
 
Old 03-16-2019, 07:04 AM   #3
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,780

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by rnturn View Post
* Given that the output from "file" might be different for each email file, am I stuck putting a complex set of tests in place to choose a proper encoding for each file? (Yes, everything in UNIX is a file but what kind of file? :^/ )
Depending on how thoroughly you want to scan, it might help to read in binary mode (every kind of file in UNIX has bytes in it). Although if you want to make sense of non-ASCII text, figuring out the proper encoding will be necessary.

https://docs.python.org/3/library/functions.html#open
Quote:
Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding.
 
Old 03-16-2019, 07:20 AM   #4
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,856
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
* Aren't emails supposed to consist of only "printable" characters? (Silly me... I thought that was the whole idea behind using Base-64 encoding.)

Definition of 'printable' depends on header-fields, eg:
Code:
Content-Type: text/plain; charset=ISO-8859-2 | UTF-8 | ...
Content-Transfer-Encoding: 8bit | quoted-printable | base64
* Is there an "open()" option that'll allow reading this character? Or, something that would likely work out just fine for my purposes, simply ignoring it?

In ISO-8859-x every character is valid, so you won't get errors. Or there should be a binary mode.

* Given that the output from "file" might be different for each email file, am I stuck putting a complex set of tests in place to choose a proper encoding for each file? (Yes, everything in UNIX is a file but what kind of file? :^/ )

Sure. Don't forget multipart letters: every part has its own settings.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting UTF-16 files to another encoding (such as UTF-8) crisostomo_enrico Solaris / OpenSolaris 3 03-25-2008 05:30 PM
im getting UTF-8 to STRING: Could not open converter from 'UTF-8' to 'ISO-8859-1' jabka Linux - Newbie 2 11-24-2006 05:44 AM
How do I know how a file is encoded? UTF-8, UTF-16, etc.. ?? brynjarh Linux - General 1 12-03-2004 11:11 AM
[Enter] in text documents diffrent on Windows and Linux? UTF-8/UTF-16 problem or? brynjarh Linux - General 1 11-24-2004 05:20 AM
X11 / UTF-8 locale seems missing 'fr_FR.UTF-8' chrsitophermann Debian 11 07-17-2004 02:04 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:18 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration