ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I ran into a problem today while trying to read from a file that turned out to have accented characters in it. (In this case an umlauted "o" and an accented "e").
I have a little function (Thank you, O'Reilly) which seems like it'd work for my needs. The trouble is that script is actually aborting on the reads from the file before the function gets a crack at translating anything:
Code:
open( datafile, 'r' ) as input
for record in input: <--<< Error occurs here
if string1 in record:
# process it...
if string2 in record:
# etc.
Since the variable "record" is never assigned anything, I can never get to any place where I can invoke that hopefully-handy function.
I've tried an alternate means of reading the records from the file:
Code:
with open( inf_file, 'r' ) as inf:
records = ( record.strip() for record in inf )
for raw_rec in records: <--<< Now error occurs here
which gets me past the assigned from the records on disk but now everything blows up when I try to assign any data to "raw_rec".
What's the correct, Pythonesque way to read individual records from a file that may contain a Unicode character here and there? These cases are likely going to few and far between but I'd sure like to make this as generic and flexible as possible.
Note: I've tried opening the file using 'rb' but I'm still getting stuck on that "for" construct when assigning anything to "record" or "raw_rec".
Any hints as to a way out of this dilemma? (Still digging through my local references for clues. Nothing so far.)
Python is pretty nice but dealing with the labyrinth of methods for just trying to read data out of a file -- especially when reading one record at a time -- can be a real headache. This script had been working just fine until Unicode raised its ugly head. :^(
Note: I've tried opening the file using 'rb' but I'm still getting stuck on that "for" construct when assigning anything to "record" or "raw_rec".
You should never open any text file in binary mode, especially as Unicode chars are not (always) single byte ones. Input translations should somehow be brought to bear.
As I don't know python I don't know how, and of course it depends on the Unicode character set as well:
UTF-8 is 1 (pure ASCII set) up to 4 bytes (rare characters), so variable length
UTF-16 is 2 or 4 bytes
Quote:
UTF-16 arose from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set)
and there's UCS-4 too, which is always 4 bytes.
I _believe_ Windows normally uses UTF-16, while Linux mostly uses UTF-8
So input processing should know, especially in UTF-8, when it should read the next byte(s) too for the char to be complete.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by ehartman
You should never open any text file in binary mode, especially as Unicode chars are not (always) single byte ones. Input translations should somehow be brought to bear.
That's obvious from my attempts. Binary mode made things a lot worse. :^)
I don't see a problem with your code, and Python (normally) handles Unicode characters just fine.
The following is a direct copy and paste from my terminal:
Code:
~/scratch took 37s
❯ cat text.txt
brütal doom
~/scratch
❯ python3
Python 3.8.2 (default, Feb 28 2020, 00:00:00)
[GCC 10.0.1 20200216 (Red Hat 10.0.1-0.8)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('text.txt') as f:
... for line in f:
... print(line)
...
brütal doom
>>>
So where do you go from here?
Have the loop print out each "record" (line), so you know which line in the input file is causing the problem.
Post the input file if you can. Or, ideally, the line in the file that triggers it.
Posting the crash message (stack trace) would have been informative too.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by shruggy
I'm not very well-versed in Python, but I'd probably try this (not tested):
Code:
with open( inf_file, 'rb' ) as inf:
records = ( record.decode('utf-8').strip() for record in inf )
Another option would be
Code:
with open( inf_file, 'r', encoding='utf-8' ) as inf:
records = ( record.strip() for record in inf )
The first option I'm not looking at as the binary mode I/O has been problematic: Unicode + binary is a bit of a nightmare.
Running file(1) on the data file returns "ISO-8859 text".
I've tried that second option and it blows up on the first character in the "records" blob that's not ASCII. Without "encoding='utf-8'":
Code:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 464: ordinal not in range(128)
Not unexpected.
With it:
Code:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 464: invalid continuation byte
Can't win fer losing today.
To boot, reading everything into that blob could be a concern if I ever encountered any really large files. So I'm still looking for a record-oriented solution. 'Lutz' hasn't been much help so far. Time to widen the search even further.
[bhod]
Update:
I tried the following:
Code:
with open( input_file, 'r', encoding='latin-1' ) as input:
records = ( record.strip() for record in input )
for raw_rec in records:
input_arr = cleanup_string( raw_rec ).split( "=" )
print( input_arr )
and no Unicode exceptions. I'm worried that this may not be a universal solution and the type of file the script encounters in the future may cause more problems. Second thing is that it's slo-o-ow. Those prints are displayed about one/second.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by ehartman
Certainly not, 0xE9 is a ISO_8859-* character:
Code:
351 233 E9 é LATIN SMALL LETTER E WITH ACUTE
UTF-8 chars above the 0x7F one have a wholly different encoding (and normally introduce a multi-byte character).
Yeah. Sorry I got sucked into a conference call right before you replied and didn't get my update above noting the use of the "latin-1" encoding submitted until just now. I'm not sure how just how I'll handle the possible different file types -- and using the correct encoding -- this script might encounter but it's bound to be "interesting".
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by shruggy
To explain a bit. The first approach is for the situation when you don't know the encoding of input data. Then
import chardet and let it guess how the data were encoded.
decode from the guessed encoding.
The second one is for when you know the encoding. Then you open the file as text with the encoding= parameter.
I've not seen "chardet" before. More "leisure reading" it seems.
The files I'm reading come from a variety of sources/authors but the majority have been plain ol' ASCII so the current code was designed to work with that. Then "special" cases like ISO-8859/latin-1 came up... :^D "chardet" may save me the headache of having to temporarily mess with the "encoding=" in the open() statement(s). Thanks for the tips.
Latin-1 is iso_8859-1, there are at least 9 other variations of the "latin" character set in the ISO 8859 standard:
Code:
The full set of ISO 8859 alphabets includes:
ISO 8859-1 West European languages (Latin-1)
ISO 8859-2 Central and East European languages (Latin-2)
ISO 8859-3 Southeast European and miscellaneous languages (Latin-3)
ISO 8859-4 Scandinavian/Baltic languages (Latin-4)
ISO 8859-5 Latin/Cyrillic
ISO 8859-6 Latin/Arabic
ISO 8859-7 Latin/Greek
ISO 8859-8 Latin/Hebrew
ISO 8859-9 Latin-1 modification for Turkish (Latin-5)
ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6)
ISO 8859-11 Latin/Thai
ISO 8859-13 Baltic Rim languages (Latin-7)
ISO 8859-14 Celtic (Latin-8)
ISO 8859-15 West European languages (Latin-9)
ISO 8859-16 Romanian (Latin-10)
(from the man page)
Latin-9 is a later modification of the Latin-1 set, with for instance a Euro currency sign.
Sorry, I need to ask a stupid question. Are you sure you’re using Python 3?
Python 2 did require you to call .encode('utf-8') and .decode('utf-8') on the strings, and could crash in the exact same way you described if you forgot to. Python 3, on the other hand, has perfect UTF-8 support. I can't reproduce your crashes on Python 3, and I don't see how anyone could either.
Your problems would make a lot more sense if they were happening on Python 2.
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803
Original Poster
Rep:
Quote:
Originally Posted by dugan
Sorry, I need to ask a stupid question. Are you sure you’re using Python 3?
Python 2 did require you to call .encode('utf-8') and .decode('utf-8') on the strings, and could crash in the exact same way you described if you forgot to. Python 3, on the other hand, has perfect UTF-8 support. I can't reproduce your crashes on Python 3, and I don't see how anyone could either.
Your problems would make a lot moresense if they were happening on Python 2.
First, not a stupid question.
Second, sorry for not following up until now... (all the days are blending together).
Third, as for the 2-vs-3 question: Yes. According to the shebang, it's Python 3. (Running this script using Python 2 would result in a slew of print() failures.)
UPDATE 1: I narrowed the problem down to the LC_ALL environment variable. It seems to be have gotten set to "C" somewhere in either the system startup or user login. If I run my unicode reader script against that file using:
Code:
LC_ALL="" ./unicode_reader
it works fine.
[time passes...]
UPDATE 2: I found an ancient file that I've been sourcing as part of my login profile since, well... forever, that was setting "LC_ALL=C"---that fixed something ages ago... I just cannot recall what that was now.
Removing that line from that sourced file, logging off, and back on again, then running the unicode test script and all is well.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.