[SOLVED] using iconv to change character encoding

Pedroski · 03-01-2019, 10:22 PM

To help the gf, I'm trying to make a python program to get a lot of data from a webpage and write it to excel. I got one set of data, but the encoding seems to be GB2312

Code:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

I wrote the data to a text file in Python, it seems to be UTF-8 but the Chinese names do not display correctly.

I want to convert the file to GB2312, in the hope that the characters will display correctly.

Quote:

pedro@pedro-newssd:~/Documents$ file -i page1
page1: text/plain; charset=utf-8
pedro@pedro-newssd:~/Documents$ iconv -f UTF-8 -t GB2312 page1 -o page1gb2312
iconv: illegal input sequence at position 5

I think what I need is to write the file in Python a GB2312

How do I tell Python to encode GB2312?

Quote:

>>> file = open(path + filename, 'w')
>>> file.write(line)

pedro@pedro-newssd:~/Documents$ iconv -f UTF-8 -t gb2312 page1 -o page1gb2312
iconv: illegal input sequence at position 5
pedro@pedro-newssd:~/Documents$

The above produces the file page1gb2312, but it is empty. Keep getting:

Quote:

iconv: illegal input sequence at position 5

Any tips please? Got to keep the gf happy!

What am I doing wrong?

RandomTroll · 03-01-2019, 11:57 PM

I never got iconv to work. I just tried an example from its man page: it returned

Quote:

illegal input sequence at position 4

- so much for documentation. I use utf8trans instead. The drawback to that is that the provided tables were incomplete so I had to add entries for new characters. Perhaps I could have found better tables had I searched. A search on gb2312 on my computer turned up some python stuff that looks like it's meant to handle it, perhaps convert to/from UTF.

Quote:

usr/lib64/python2.7/test/cjkencodings/gb2312-utf8.txt
usr/lib64/python2.7/test/cjkencodings/gb2312.txt

for example.

Pedroski · 03-02-2019, 06:39 PM

Thanks, but I did not have much luck with utf8trans either! I get a lot of this:

Quote:

pedro@pedro-newssd:~/Documents$ utf8trans /home/pedro/Documents/page1 -m GB2312utf8trans:/home/pedro/Documents/page1:6: (parsing codepoint) invalid hex number
utf8trans:/home/pedro/Documents/page1:12: (parsing codepoint) invalid hex number
utf8trans:/home/pedro/Documents/page1:15: (parsing codepoint) invalid hex number

RandomTroll · 03-02-2019, 10:22 PM

I'm not sure you've used utf8trans correctly. You specify a table as the first argument. For example:

Code:

utf8trans utf2gb2312 < FileToTranslate

. /usr/share/i18n/charmaps/GB2312.gz gunzipped may be it.

Pedroski · 03-03-2019, 02:01 AM

Thanks again!
Well it is a bit unclear to me.

Quote:

pedro@pedro-newssd:~/Documents$ utf8trans --help
Usage: utf8trans [options] CHARMAP [FILES...]
Transliterate UTF-8 characters according to a table.

-m, --modify modify given files in-place
-v, --version display version information and exit
-h, --help display this usage information

See utf8trans(1) for details on this program.

so I tried:

utf8trans -m GB2312 page1

then I get:

Quote:

pedro@pedro-newssd:~/Documents$ utf8trans -m GB2312 page1
utf8trans:GB2312: No such file or directory
pedro@pedro-newssd:~/Documents$

I do have /usr/share/i18n/charmaps/GB2312.gz

Any tips about what to do with it? Unpack it to where?

hydrurga · 03-03-2019, 03:12 AM

How did you originally convert the GB2313 encoded file to UTF8 in Python?

See this page: https://stackoverflow.com/questions/...ding-in-python

RandomTroll · 03-03-2019, 03:52 PM

Quote:

Originally Posted by Pedroski

I tried:

utf8trans -m GB2312 page1

The first argument to utf8trans is a character map. A character map is a 2-column file, the first column hex characters in utf-8, the second target characters, separated by a tab. Unfortunately GB2312.gz is not in the correct format for utf8trans. I don't remember, but I suspect I created my own character maps. I'd hope there's one for GB2312 somewhere but I don't know where. I'd also hope that i18n's character map would serve some translation utility that would do the same. If it were up to me I'd write a program to translate i18n's character map to utf8trans format, but that's because I've already made utf8trans work for other purposes. Whether that'd be the best use of your time is a different question.

I don't use python but it looks like it has translation facilities for utf & gb2312 in it already. Perhaps a python-knowledgeable person would know.

Pedroski · 03-03-2019, 05:08 PM

Quote:

How did you originally convert the GB2313 encoded file to UTF8 in Python?

I got the data from the webpage with:

Quote:

line = soup.find('table').text

then I just opened a file, wrote line to the file, closed the file. I presume Python default is to write UTF-8. I think what it was writing was encoded GB2312

Quote:

data = open(path + 'page1', 'w')
data.write(line)
data.close()

The link to stackoverflow is very promising. Thank you very much!

Don't know why it will not accept a string, but that can be tweaked I think! Progress! Thanks!

I just tried this in my Python terminal, and I get Chinese:

Quote:

data = '»Æ¹ûÊ÷'
data.encode('latin1').decode('gb2312')
'黄果树'

Great!

ondoho · 03-05-2019, 12:05 AM

Quote:

Originally Posted by Pedroski

Code:

utf8trans -m GB2312 page1

:sigh:
have you ever thought far enough to try (*)

Code:

utf8trans GB2312 -m page1

???

and please use CODE tags for code. not QUOTE tags, not no tags.

(*) oh, i see you marked this SOLVED. i guess you did then.