[SOLVED] Python3: Reading file containing Unicode characters... How?

rnturn · 05-08-2020, 03:11 PM

I ran into a problem today while trying to read from a file that turned out to have accented characters in it. (In this case an umlauted "o" and an accented "e").

I have a little function (Thank you, O'Reilly) which seems like it'd work for my needs. The trouble is that script is actually aborting on the reads from the file before the function gets a crack at translating anything:

Code:

open( datafile, 'r' ) as input
    for record in input:                     <--<< Error occurs here
        if string1 in record:
            # process it...
        if string2 in record:
            # etc.

Since the variable "record" is never assigned anything, I can never get to any place where I can invoke that hopefully-handy function.

I've tried an alternate means of reading the records from the file:

Code:

with open( inf_file, 'r' ) as inf:
    records = ( record.strip() for record in inf )
    for raw_rec in records:               <--<< Now error occurs here

which gets me past the assigned from the records on disk but now everything blows up when I try to assign any data to "raw_rec".

What's the correct, Pythonesque way to read individual records from a file that may contain a Unicode character here and there? These cases are likely going to few and far between but I'd sure like to make this as generic and flexible as possible.

Note: I've tried opening the file using 'rb' but I'm still getting stuck on that "for" construct when assigning anything to "record" or "raw_rec".

Any hints as to a way out of this dilemma? (Still digging through my local references for clues. Nothing so far.)

Python is pretty nice but dealing with the labyrinth of methods for just trying to read data out of a file -- especially when reading one record at a time -- can be a real headache. This script had been working just fine until Unicode raised its ugly head. :^(

TIA...

ehartman · 05-08-2020, 04:13 PM

Quote:

Originally Posted by rnturn

Note: I've tried opening the file using 'rb' but I'm still getting stuck on that "for" construct when assigning anything to "record" or "raw_rec".

You should never open any text file in binary mode, especially as Unicode chars are not (always) single byte ones. Input translations should somehow be brought to bear.
As I don't know python I don't know how, and of course it depends on the Unicode character set as well:
UTF-8 is 1 (pure ASCII set) up to 4 bytes (rare characters), so variable length
UTF-16 is 2 or 4 bytes

Quote:

UTF-16 arose from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set)

and there's UCS-4 too, which is always 4 bytes.
I _believe_ Windows normally uses UTF-16, while Linux mostly uses UTF-8
So input processing should know, especially in UTF-8, when it should read the next byte(s) too for the char to be complete.

shruggy · 05-08-2020, 04:47 PM

I'm not very well-versed in Python, but I'd probably try this (not tested):

Code:

with open( inf_file, 'rb' ) as inf:
    records = ( record.decode('utf-8').strip() for record in inf )

Another option would be

Code:

with open( inf_file, 'r', encoding='utf-8' ) as inf:
    records = ( record.strip() for record in inf )

rnturn · 05-08-2020, 05:33 PM

Quote:

Originally Posted by ehartman

You should never open any text file in binary mode, especially as Unicode chars are not (always) single byte ones. Input translations should somehow be brought to bear.

That's obvious from my attempts. Binary mode made things a lot worse. :^)

dugan · 05-08-2020, 06:05 PM

I don't see a problem with your code, and Python (normally) handles Unicode characters just fine.

The following is a direct copy and paste from my terminal:

Code:

~/scratch took 37s 
❯ cat text.txt
brütal doom

~/scratch 
❯ python3
Python 3.8.2 (default, Feb 28 2020, 00:00:00) 
[GCC 10.0.1 20200216 (Red Hat 10.0.1-0.8)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('text.txt') as f:
...     for line in f:
...         print(line)
... 
brütal doom

>>>

So where do you go from here?

Have the loop print out each "record" (line), so you know which line in the input file is causing the problem.

Post the input file if you can. Or, ideally, the line in the file that triggers it.

Posting the crash message (stack trace) would have been informative too.

rnturn · 05-08-2020, 06:42 PM

Quote:

Originally Posted by shruggy

I'm not very well-versed in Python, but I'd probably try this (not tested):

Code:

with open( inf_file, 'rb' ) as inf:
    records = ( record.decode('utf-8').strip() for record in inf )

Another option would be

Code:

with open( inf_file, 'r', encoding='utf-8' ) as inf:
    records = ( record.strip() for record in inf )

The first option I'm not looking at as the binary mode I/O has been problematic: Unicode + binary is a bit of a nightmare.

Running file(1) on the data file returns "ISO-8859 text".

I've tried that second option and it blows up on the first character in the "records" blob that's not ASCII. Without "encoding='utf-8'":

Code:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 464: ordinal not in range(128)

Not unexpected.

With it:

Code:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 464: invalid continuation byte

Can't win fer losing today.

To boot, reading everything into that blob could be a concern if I ever encountered any really large files. So I'm still looking for a record-oriented solution. 'Lutz' hasn't been much help so far. Time to widen the search even further.

[bhod]

Update:

I tried the following:

Code:

with open( input_file, 'r', encoding='latin-1' ) as input:
    records = ( record.strip() for record in input )
    for raw_rec in records:
        input_arr = cleanup_string( raw_rec ).split( "=" )
        print( input_arr )

and no Unicode exceptions. I'm worried that this may not be a universal solution and the type of file the script encounters in the future may cause more problems. Second thing is that it's slo-o-ow. Those prints are displayed about one/second.

ehartman · 05-08-2020, 07:27 PM

Quote:

Originally Posted by rnturn

Code:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 464: ordinal not in range(128)

Not unexpected.

Certainly not, 0xE9 is a ISO_8859-* character:

Code:

351   233   E9    é     LATIN SMALL LETTER E WITH ACUTE

UTF-8 chars above the 0x7F one have a wholly different encoding (and normally introduce a multi-byte character).

rnturn · 05-08-2020, 10:28 PM

Quote:

Originally Posted by ehartman

Certainly not, 0xE9 is a ISO_8859-* character:

Code:

351   233   E9    é     LATIN SMALL LETTER E WITH ACUTE

UTF-8 chars above the 0x7F one have a wholly different encoding (and normally introduce a multi-byte character).

Yeah. Sorry I got sucked into a conference call right before you replied and didn't get my update above noting the use of the "latin-1" encoding submitted until just now. I'm not sure how just how I'll handle the possible different file types -- and using the correct encoding -- this script might encounter but it's bound to be "interesting".

dugan · 05-09-2020, 12:08 AM

Test your code against this file please.

https://www.cl.cam.ac.uk/~mgk25/ucs/...quickbrown.txt

shruggy · 05-09-2020, 03:28 AM

Quote:

Originally Posted by rnturn

The first option I'm not looking at as the binary mode I/O has been problematic: Unicode + binary is a bit of a nightmare.

To explain a bit. The first approach is for the situation when you don't know the encoding of input data. Then

open them in binary mode as a byte stream;
import chardet and let it guess how the data were encoded.
decode from the guessed encoding.

The second one is for when you know the encoding. Then you open the file as text with the encoding= parameter.

rnturn · 05-09-2020, 01:02 PM

Quote:

Originally Posted by dugan

Test your code against this file please.

https://www.cl.cam.ac.uk/~mgk25/ucs/...quickbrown.txt

Nice. An "acid test" of sorts.

I'll have to pull out my basic I/O loop and see how it handles that. Thanks.

UPDATE: It handles it badly. Crashes on the first line of Danish text.

rnturn · 05-09-2020, 01:14 PM

Quote:

Originally Posted by shruggy

To explain a bit. The first approach is for the situation when you don't know the encoding of input data. Then

import chardet and let it guess how the data were encoded.
decode from the guessed encoding.

The second one is for when you know the encoding. Then you open the file as text with the encoding= parameter.

I've not seen "chardet" before. More "leisure reading" it seems.

The files I'm reading come from a variety of sources/authors but the majority have been plain ol' ASCII so the current code was designed to work with that. Then "special" cases like ISO-8859/latin-1 came up... :^D "chardet" may save me the headache of having to temporarily mess with the "encoding=" in the open() statement(s). Thanks for the tips.

Cheers...

ehartman · 05-09-2020, 05:36 PM

Quote:

Originally Posted by rnturn

like ISO-8859/latin-1 came up...

Latin-1 is iso_8859-1, there are at least 9 other variations of the "latin" character set in the ISO 8859 standard:

Code:

The full set of ISO 8859 alphabets includes:

       ISO 8859-1    West European languages (Latin-1)
       ISO 8859-2    Central and East European languages (Latin-2)
       ISO 8859-3    Southeast European and miscellaneous languages (Latin-3)
       ISO 8859-4    Scandinavian/Baltic languages (Latin-4)
       ISO 8859-5    Latin/Cyrillic
       ISO 8859-6    Latin/Arabic
       ISO 8859-7    Latin/Greek
       ISO 8859-8    Latin/Hebrew
       ISO 8859-9    Latin-1 modification for Turkish (Latin-5)
       ISO 8859-10   Lappish/Nordic/Eskimo languages (Latin-6)
       ISO 8859-11   Latin/Thai
       ISO 8859-13   Baltic Rim languages (Latin-7)
       ISO 8859-14   Celtic (Latin-8)
       ISO 8859-15   West European languages (Latin-9)
       ISO 8859-16   Romanian (Latin-10)

(from the man page)

Latin-9 is a later modification of the Latin-1 set, with for instance a Euro currency sign.

dugan · 05-11-2020, 08:49 AM

Sorry, I need to ask a stupid question. Are you sure you’re using Python 3?

Python 2 did require you to call .encode('utf-8') and .decode('utf-8') on the strings, and could crash in the exact same way you described if you forgot to. Python 3, on the other hand, has perfect UTF-8 support. I can't reproduce your crashes on Python 3, and I don't see how anyone could either.

Your problems would make a lot more sense if they were happening on Python 2.

rnturn · 05-17-2020, 05:08 PM

Quote:

Originally Posted by dugan

Sorry, I need to ask a stupid question. Are you sure you’re using Python 3?

Python 2 did require you to call .encode('utf-8') and .decode('utf-8') on the strings, and could crash in the exact same way you described if you forgot to. Python 3, on the other hand, has perfect UTF-8 support. I can't reproduce your crashes on Python 3, and I don't see how anyone could either.

Your problems would make a lot moresense if they were happening on Python 2.

First, not a stupid question.

Second, sorry for not following up until now... (all the days are blending together).

Third, as for the 2-vs-3 question: Yes. According to the shebang, it's Python 3. (Running this script using Python 2 would result in a slew of print() failures.)

UPDATE 1: I narrowed the problem down to the LC_ALL environment variable. It seems to be have gotten set to "C" somewhere in either the system startup or user login. If I run my unicode reader script against that file using:

Code:

LC_ALL="" ./unicode_reader

it works fine.

[time passes...]

UPDATE 2: I found an ancient file that I've been sourcing as part of my login profile since, well... forever, that was setting "LC_ALL=C"---that fixed something ages ago... I just cannot recall what that was now.

Removing that line from that sourced file, logging off, and back on again, then running the unicode test script and all is well.

I'll mark this as "Solved".

Thanks all for the feedback.

Later... and stay safe.