LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-08-2020, 03:11 PM   #1
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Python3: Reading file containing Unicode characters... How?


I ran into a problem today while trying to read from a file that turned out to have accented characters in it. (In this case an umlauted "o" and an accented "e").

I have a little function (Thank you, O'Reilly) which seems like it'd work for my needs. The trouble is that script is actually aborting on the reads from the file before the function gets a crack at translating anything:
Code:
open( datafile, 'r' ) as input
    for record in input:                     <--<< Error occurs here
        if string1 in record:
            # process it...
        if string2 in record:
            # etc.
Since the variable "record" is never assigned anything, I can never get to any place where I can invoke that hopefully-handy function.

I've tried an alternate means of reading the records from the file:
Code:
with open( inf_file, 'r' ) as inf:
    records = ( record.strip() for record in inf )
    for raw_rec in records:               <--<< Now error occurs here
which gets me past the assigned from the records on disk but now everything blows up when I try to assign any data to "raw_rec".

What's the correct, Pythonesque way to read individual records from a file that may contain a Unicode character here and there? These cases are likely going to few and far between but I'd sure like to make this as generic and flexible as possible.

Note: I've tried opening the file using 'rb' but I'm still getting stuck on that "for" construct when assigning anything to "record" or "raw_rec".

Any hints as to a way out of this dilemma? (Still digging through my local references for clues. Nothing so far.)

Python is pretty nice but dealing with the labyrinth of methods for just trying to read data out of a file -- especially when reading one record at a time -- can be a real headache. This script had been working just fine until Unicode raised its ugly head. :^(

TIA...
 
Old 05-08-2020, 04:13 PM   #2
ehartman
Senior Member
 
Registered: Jul 2007
Location: Delft, The Netherlands
Distribution: Slackware
Posts: 1,674

Rep: Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888
Quote:
Originally Posted by rnturn View Post
Note: I've tried opening the file using 'rb' but I'm still getting stuck on that "for" construct when assigning anything to "record" or "raw_rec".
You should never open any text file in binary mode, especially as Unicode chars are not (always) single byte ones. Input translations should somehow be brought to bear.
As I don't know python I don't know how, and of course it depends on the Unicode character set as well:
UTF-8 is 1 (pure ASCII set) up to 4 bytes (rare characters), so variable length
UTF-16 is 2 or 4 bytes
Quote:
UTF-16 arose from an earlier fixed-width 16-bit encoding known as UCS-2 (for 2-byte Universal Character Set)
and there's UCS-4 too, which is always 4 bytes.
I _believe_ Windows normally uses UTF-16, while Linux mostly uses UTF-8
So input processing should know, especially in UTF-8, when it should read the next byte(s) too for the char to be complete.

Last edited by ehartman; 05-08-2020 at 04:16 PM.
 
Old 05-08-2020, 04:47 PM   #3
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
I'm not very well-versed in Python, but I'd probably try this (not tested):
Code:
with open( inf_file, 'rb' ) as inf:
    records = ( record.decode('utf-8').strip() for record in inf )
Another option would be
Code:
with open( inf_file, 'r', encoding='utf-8' ) as inf:
    records = ( record.strip() for record in inf )

Last edited by shruggy; 05-08-2020 at 05:09 PM.
 
Old 05-08-2020, 05:33 PM   #4
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by ehartman View Post
You should never open any text file in binary mode, especially as Unicode chars are not (always) single byte ones. Input translations should somehow be brought to bear.
That's obvious from my attempts. Binary mode made things a lot worse. :^)
 
Old 05-08-2020, 06:05 PM   #5
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,235

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
I don't see a problem with your code, and Python (normally) handles Unicode characters just fine.

The following is a direct copy and paste from my terminal:

Code:
~/scratch took 37s 
❯ cat text.txt
brütal doom

~/scratch 
❯ python3
Python 3.8.2 (default, Feb 28 2020, 00:00:00) 
[GCC 10.0.1 20200216 (Red Hat 10.0.1-0.8)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('text.txt') as f:
...     for line in f:
...         print(line)
... 
brütal doom

>>>

So where do you go from here?

Have the loop print out each "record" (line), so you know which line in the input file is causing the problem.

Post the input file if you can. Or, ideally, the line in the file that triggers it.

Posting the crash message (stack trace) would have been informative too.

Last edited by dugan; 05-08-2020 at 07:01 PM.
 
Old 05-08-2020, 06:42 PM   #6
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by shruggy View Post
I'm not very well-versed in Python, but I'd probably try this (not tested):
Code:
with open( inf_file, 'rb' ) as inf:
    records = ( record.decode('utf-8').strip() for record in inf )
Another option would be
Code:
with open( inf_file, 'r', encoding='utf-8' ) as inf:
    records = ( record.strip() for record in inf )

The first option I'm not looking at as the binary mode I/O has been problematic: Unicode + binary is a bit of a nightmare.

Running file(1) on the data file returns "ISO-8859 text".

I've tried that second option and it blows up on the first character in the "records" blob that's not ASCII. Without "encoding='utf-8'":
Code:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 464: ordinal not in range(128)
Not unexpected.

With it:
Code:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 464: invalid continuation byte
Can't win fer losing today.

To boot, reading everything into that blob could be a concern if I ever encountered any really large files. So I'm still looking for a record-oriented solution. 'Lutz' hasn't been much help so far. Time to widen the search even further.

[bhod]

Update:

I tried the following:
Code:
with open( input_file, 'r', encoding='latin-1' ) as input:
    records = ( record.strip() for record in input )
    for raw_rec in records:
        input_arr = cleanup_string( raw_rec ).split( "=" )
        print( input_arr )
and no Unicode exceptions. I'm worried that this may not be a universal solution and the type of file the script encounters in the future may cause more problems. Second thing is that it's slo-o-ow. Those prints are displayed about one/second.

Last edited by rnturn; 05-08-2020 at 10:23 PM.
 
Old 05-08-2020, 07:27 PM   #7
ehartman
Senior Member
 
Registered: Jul 2007
Location: Delft, The Netherlands
Distribution: Slackware
Posts: 1,674

Rep: Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888
Quote:
Originally Posted by rnturn View Post
Code:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 464: ordinal not in range(128)
Not unexpected.
Certainly not, 0xE9 is a ISO_8859-* character:
Code:
351   233   E9    é     LATIN SMALL LETTER E WITH ACUTE
UTF-8 chars above the 0x7F one have a wholly different encoding (and normally introduce a multi-byte character).
 
Old 05-08-2020, 10:28 PM   #8
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by ehartman View Post
Certainly not, 0xE9 is a ISO_8859-* character:
Code:
351   233   E9    é     LATIN SMALL LETTER E WITH ACUTE
UTF-8 chars above the 0x7F one have a wholly different encoding (and normally introduce a multi-byte character).
Yeah. Sorry I got sucked into a conference call right before you replied and didn't get my update above noting the use of the "latin-1" encoding submitted until just now. I'm not sure how just how I'll handle the possible different file types -- and using the correct encoding -- this script might encounter but it's bound to be "interesting".
 
Old 05-09-2020, 12:08 AM   #9
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,235

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Test your code against this file please.

https://www.cl.cam.ac.uk/~mgk25/ucs/...quickbrown.txt
 
Old 05-09-2020, 03:28 AM   #10
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
Quote:
Originally Posted by rnturn View Post
The first option I'm not looking at as the binary mode I/O has been problematic: Unicode + binary is a bit of a nightmare.
To explain a bit. The first approach is for the situation when you don't know the encoding of input data. Then
  1. open them in binary mode as a byte stream;
  2. import chardet and let it guess how the data were encoded.
  3. decode from the guessed encoding.
The second one is for when you know the encoding. Then you open the file as text with the encoding= parameter.
 
1 members found this post helpful.
Old 05-09-2020, 01:02 PM   #11
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by dugan View Post
Test your code against this file please.

https://www.cl.cam.ac.uk/~mgk25/ucs/...quickbrown.txt
Nice. An "acid test" of sorts.

I'll have to pull out my basic I/O loop and see how it handles that. Thanks.

UPDATE: It handles it badly. Crashes on the first line of Danish text.

Last edited by rnturn; 05-09-2020 at 05:14 PM. Reason: Added results of reading `quickbrown.txt'.
 
Old 05-09-2020, 01:14 PM   #12
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by shruggy View Post
To explain a bit. The first approach is for the situation when you don't know the encoding of input data. Then
  1. import chardet and let it guess how the data were encoded.
  2. decode from the guessed encoding.
The second one is for when you know the encoding. Then you open the file as text with the encoding= parameter.
I've not seen "chardet" before. More "leisure reading" it seems.

The files I'm reading come from a variety of sources/authors but the majority have been plain ol' ASCII so the current code was designed to work with that. Then "special" cases like ISO-8859/latin-1 came up... :^D "chardet" may save me the headache of having to temporarily mess with the "encoding=" in the open() statement(s). Thanks for the tips.

Cheers...
 
Old 05-09-2020, 05:36 PM   #13
ehartman
Senior Member
 
Registered: Jul 2007
Location: Delft, The Netherlands
Distribution: Slackware
Posts: 1,674

Rep: Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888Reputation: 888
Quote:
Originally Posted by rnturn View Post
like ISO-8859/latin-1 came up...
Latin-1 is iso_8859-1, there are at least 9 other variations of the "latin" character set in the ISO 8859 standard:
Code:
The full set of ISO 8859 alphabets includes:

       ISO 8859-1    West European languages (Latin-1)
       ISO 8859-2    Central and East European languages (Latin-2)
       ISO 8859-3    Southeast European and miscellaneous languages (Latin-3)
       ISO 8859-4    Scandinavian/Baltic languages (Latin-4)
       ISO 8859-5    Latin/Cyrillic
       ISO 8859-6    Latin/Arabic
       ISO 8859-7    Latin/Greek
       ISO 8859-8    Latin/Hebrew
       ISO 8859-9    Latin-1 modification for Turkish (Latin-5)
       ISO 8859-10   Lappish/Nordic/Eskimo languages (Latin-6)
       ISO 8859-11   Latin/Thai
       ISO 8859-13   Baltic Rim languages (Latin-7)
       ISO 8859-14   Celtic (Latin-8)
       ISO 8859-15   West European languages (Latin-9)
       ISO 8859-16   Romanian (Latin-10)
(from the man page)

Latin-9 is a later modification of the Latin-1 set, with for instance a Euro currency sign.

Last edited by ehartman; 05-09-2020 at 05:38 PM.
 
Old 05-11-2020, 08:49 AM   #14
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,235

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Sorry, I need to ask a stupid question. Are you sure you’re using Python 3?

Python 2 did require you to call .encode('utf-8') and .decode('utf-8') on the strings, and could crash in the exact same way you described if you forgot to. Python 3, on the other hand, has perfect UTF-8 support. I can't reproduce your crashes on Python 3, and I don't see how anyone could either.

Your problems would make a lot more sense if they were happening on Python 2.

Last edited by dugan; 05-12-2020 at 01:46 AM.
 
Old 05-17-2020, 05:08 PM   #15
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,803

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by dugan View Post
Sorry, I need to ask a stupid question. Are you sure you’re using Python 3?

Python 2 did require you to call .encode('utf-8') and .decode('utf-8') on the strings, and could crash in the exact same way you described if you forgot to. Python 3, on the other hand, has perfect UTF-8 support. I can't reproduce your crashes on Python 3, and I don't see how anyone could either.

Your problems would make a lot moresense if they were happening on Python 2.
First, not a stupid question.

Second, sorry for not following up until now... (all the days are blending together).

Third, as for the 2-vs-3 question: Yes. According to the shebang, it's Python 3. (Running this script using Python 2 would result in a slew of print() failures.)


UPDATE 1: I narrowed the problem down to the LC_ALL environment variable. It seems to be have gotten set to "C" somewhere in either the system startup or user login. If I run my unicode reader script against that file using:
Code:
LC_ALL="" ./unicode_reader
it works fine.

[time passes...]

UPDATE 2: I found an ancient file that I've been sourcing as part of my login profile since, well... forever, that was setting "LC_ALL=C"---that fixed something ages ago... I just cannot recall what that was now.

Removing that line from that sourced file, logging off, and back on again, then running the unicode test script and all is well.

I'll mark this as "Solved".

Thanks all for the feedback.

Later... and stay safe.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] making python3.6.4 default python3 Astral Axiom Linux - Newbie 17 04-14-2018 10:55 AM
[SOLVED] how to start python3.6 interpreter just by typing python in terminal not python3.6 bmohanraj91 Linux - Newbie 4 05-10-2017 07:51 AM
After upgrade python3.4 to python3.5.1 , not able to install packages "request" though pip3 YOGESHAS87 Linux - Software 1 08-03-2016 10:38 PM
[SOLVED] Problem displaying Unicode special characters in Urxvt/rxvt-unicode terminal shahinism Slackware 4 10-22-2012 03:08 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration