LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   [SOLVED] Strange Characters reading .csv file in Python 2 (https://www.linuxquestions.org/questions/linux-software-2/%5Bsolved%5D-strange-characters-reading-csv-file-in-python-2-a-4175591571/)

FrancisG 10-17-2016 03:41 AM

[SOLVED] Strange Characters reading .csv file in Python 2
 
I am trying to read a csv file in Python, but there are strange characters about!!
This file is downloaded from a Solar Photovoltaic Array System.
The file looks fine in gedit, geany and vim. I can import it into Libreoffice Calc, no problem.
Here are a couple of lines in gedit to show what it should look like:-
Code:

20/09/2016 00:00:00;16901.962;0.000
20/09/2016 00:05:00;16901.962;0.000

However when I try and read the file in Python using

Code:

_file = open(fl,'rU')
for line in _file:
    print line

I get gaps between each character and extra lines:-
Code:

2 0 / 0 9 / 2 0 1 6  0 0 : 0 0 : 0 0 ; 1 6 9 0 1 . 9 6 2 ; 0 . 0 0 0

 

 2 0 / 0 9 / 2 0 1 6  0 0 : 0 5 : 0 0 ; 1 6 9 0 1 . 9 6 2 ; 0 . 0 0 0

without the 'rU' in the open() just 'r' I just get one odd character printed out, and then a blank line, instead of the actual line, despite the fact I can see the line in the debugger. I am using PyCharm.
In LibreOffice Writer I get:-
#2#0#/#0#9#/#2#0#1#6# #0#0#:#0#0#:#0#0#;#1#6#9#0#1#.#9#6#2#;#0#.#0#0#0##
#2#0#/#0#9#/#2#0#1#6# #0#0#:#0#5#:#0#0#;#1#6#9#0#1#.#9#6#2#;#0#.#0#0#0##
What are all the hashes about? Is this some sort of strange encoding issue? I am using utf-8 encoding at the beginning of my script.
Thanks in advance

FrancisG 10-17-2016 06:59 AM

Answering my own post, I think this is defininately an encoding problem. The original .csv file comes from a Windows 10 machine.
here is the string using repr(line)
<CODE>
'\\'\\x002\\x000\\x00/\\x000\\x009\\x00/\\x002\\x000\\x001\\x006\\x00 \\x000\\x000\\x00:\\x000\\x000\\x00:\\x000\\x000\\x00;\\x001\\x006\\x009\\x000\\x001\\x00.\\x009\\x0 06\\x002\\x00;\\x000\\x00.\\x000\\x000\\x000\\x00\\r\\x00\\n\\''</CODE>
I think this would suggest UTF-16, but I am not sure.
Here is the output from a line
Code:

2 0 / 0 9 / 2 0 1 6  0 0 : 0 0 : 0 0 ; 1 6 9 0 1 . 9 6 2 ; 0 . 0 0 0
I have tried all sorts of things about unicode
Here are some things I have tried, with no change to the output:
where the variable 'line' is a line from the csv file
Code:

codecs.encode(unicode(line),'utf-8')
line.encode('utf-8')

then from a good presentation this function:
Code:

def to_unicode_or_bust(
        obj, encoding='utf-8'):
    if isinstance(obj, basestring):
        if not isinstance(obj, unicode):
            obj = unicode(obj, encoding)
    return obj

called as:
Code:

to_unicode_or_bust(line)
and output is identical.
Anyone good on codecs?

schneidz 10-17-2016 07:03 AM

dos2unix ?

FrancisG 10-17-2016 08:00 AM

Perfect!!! Works a treat
Thank you so much, saves a load of faffing about.


All times are GMT -5. The time now is 03:52 AM.