LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 03-01-2019, 10:22 PM   #1
Pedroski
Senior Member
 
Registered: Jan 2002
Location: Nanjing, China
Distribution: Ubuntu 20.04
Posts: 2,116

Rep: Reputation: 73
using iconv to change character encoding


To help the gf, I'm trying to make a python program to get a lot of data from a webpage and write it to excel. I got one set of data, but the encoding seems to be GB2312

Code:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
I wrote the data to a text file in Python, it seems to be UTF-8 but the Chinese names do not display correctly.

I want to convert the file to GB2312, in the hope that the characters will display correctly.

Quote:
pedro@pedro-newssd:~/Documents$ file -i page1
page1: text/plain; charset=utf-8
pedro@pedro-newssd:~/Documents$ iconv -f UTF-8 -t GB2312 page1 -o page1gb2312
iconv: illegal input sequence at position 5

I think what I need is to write the file in Python a GB2312

How do I tell Python to encode GB2312?

Quote:
>>> file = open(path + filename, 'w')
>>> file.write(line)
pedro@pedro-newssd:~/Documents$ iconv -f UTF-8 -t gb2312 page1 -o page1gb2312
iconv: illegal input sequence at position 5
pedro@pedro-newssd:~/Documents$
The above produces the file page1gb2312, but it is empty. Keep getting:

Quote:
iconv: illegal input sequence at position 5
Any tips please? Got to keep the gf happy!

What am I doing wrong?

Last edited by Pedroski; 03-01-2019 at 10:56 PM.
 
Old 03-01-2019, 11:57 PM   #2
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,953

Rep: Reputation: 270Reputation: 270Reputation: 270
I never got iconv to work. I just tried an example from its man page: it returned
Quote:
illegal input sequence at position 4
- so much for documentation. I use utf8trans instead. The drawback to that is that the provided tables were incomplete so I had to add entries for new characters. Perhaps I could have found better tables had I searched. A search on gb2312 on my computer turned up some python stuff that looks like it's meant to handle it, perhaps convert to/from UTF.
Quote:
usr/lib64/python2.7/test/cjkencodings/gb2312-utf8.txt
usr/lib64/python2.7/test/cjkencodings/gb2312.txt
for example.
 
Old 03-02-2019, 06:39 PM   #3
Pedroski
Senior Member
 
Registered: Jan 2002
Location: Nanjing, China
Distribution: Ubuntu 20.04
Posts: 2,116

Original Poster
Rep: Reputation: 73
Thanks, but I did not have much luck with utf8trans either! I get a lot of this:

Quote:
pedro@pedro-newssd:~/Documents$ utf8trans /home/pedro/Documents/page1 -m GB2312utf8trans:/home/pedro/Documents/page1:6: (parsing codepoint) invalid hex number
utf8trans:/home/pedro/Documents/page1:12: (parsing codepoint) invalid hex number
utf8trans:/home/pedro/Documents/page1:15: (parsing codepoint) invalid hex number
 
Old 03-02-2019, 10:22 PM   #4
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,953

Rep: Reputation: 270Reputation: 270Reputation: 270
I'm not sure you've used utf8trans correctly. You specify a table as the first argument. For example:
Code:
utf8trans utf2gb2312 < FileToTranslate
. /usr/share/i18n/charmaps/GB2312.gz gunzipped may be it.

Last edited by RandomTroll; 03-02-2019 at 11:05 PM.
 
Old 03-03-2019, 02:01 AM   #5
Pedroski
Senior Member
 
Registered: Jan 2002
Location: Nanjing, China
Distribution: Ubuntu 20.04
Posts: 2,116

Original Poster
Rep: Reputation: 73
Thanks again!
Well it is a bit unclear to me.

Quote:
pedro@pedro-newssd:~/Documents$ utf8trans --help
Usage: utf8trans [options] CHARMAP [FILES...]
Transliterate UTF-8 characters according to a table.

-m, --modify modify given files in-place
-v, --version display version information and exit
-h, --help display this usage information

See utf8trans(1) for details on this program.
so I tried:

utf8trans -m GB2312 page1

then I get:

Quote:
pedro@pedro-newssd:~/Documents$ utf8trans -m GB2312 page1
utf8trans:GB2312: No such file or directory
pedro@pedro-newssd:~/Documents$
I do have /usr/share/i18n/charmaps/GB2312.gz

Any tips about what to do with it? Unpack it to where?
 
Old 03-03-2019, 03:12 AM   #6
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
How did you originally convert the GB2313 encoded file to UTF8 in Python?

See this page: https://stackoverflow.com/questions/...ding-in-python
 
1 members found this post helpful.
Old 03-03-2019, 03:52 PM   #7
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,953

Rep: Reputation: 270Reputation: 270Reputation: 270
Quote:
Originally Posted by Pedroski View Post
I tried:

utf8trans -m GB2312 page1
The first argument to utf8trans is a character map. A character map is a 2-column file, the first column hex characters in utf-8, the second target characters, separated by a tab. Unfortunately GB2312.gz is not in the correct format for utf8trans. I don't remember, but I suspect I created my own character maps. I'd hope there's one for GB2312 somewhere but I don't know where. I'd also hope that i18n's character map would serve some translation utility that would do the same. If it were up to me I'd write a program to translate i18n's character map to utf8trans format, but that's because I've already made utf8trans work for other purposes. Whether that'd be the best use of your time is a different question.

I don't use python but it looks like it has translation facilities for utf & gb2312 in it already. Perhaps a python-knowledgeable person would know.
 
1 members found this post helpful.
Old 03-03-2019, 05:08 PM   #8
Pedroski
Senior Member
 
Registered: Jan 2002
Location: Nanjing, China
Distribution: Ubuntu 20.04
Posts: 2,116

Original Poster
Rep: Reputation: 73
Quote:
How did you originally convert the GB2313 encoded file to UTF8 in Python?
I got the data from the webpage with:

Quote:
line = soup.find('table').text
then I just opened a file, wrote line to the file, closed the file. I presume Python default is to write UTF-8. I think what it was writing was encoded GB2312

Quote:
data = open(path + 'page1', 'w')
data.write(line)
data.close()
The link to stackoverflow is very promising. Thank you very much!

Don't know why it will not accept a string, but that can be tweaked I think! Progress! Thanks!

I just tried this in my Python terminal, and I get Chinese:

Quote:
data = '»Æ¹ûÊ÷'
data.encode('latin1').decode('gb2312')
'黄果树'
Great!

Last edited by Pedroski; 03-03-2019 at 05:47 PM.
 
Old 03-05-2019, 12:05 AM   #9
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by Pedroski View Post
Code:
utf8trans -m GB2312 page1
:sigh:
have you ever thought far enough to try (*)
Code:
utf8trans GB2312 -m page1
???

and please use CODE tags for code. not QUOTE tags, not no tags.

(*) oh, i see you marked this SOLVED. i guess you did then.

Last edited by ondoho; 03-05-2019 at 12:08 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
iconv encoding from windows to linux Perronegro Linux - Newbie 2 07-08-2012 12:46 PM
How to make iconv to skip incorrect symbols or iconv alternative? x-stream Linux - Software 4 09-26-2011 09:32 PM
LXer: Using Iconv To Convert Character Sets On Linux And Unix LXer Syndicated Linux News 0 10-07-2008 02:00 PM
Convert file from ISO-8859-1 to some Japanese encoding? (iconv errors) violagirl23 Linux - Software 5 03-26-2008 12:13 AM
iconv - why does it club and form a single character kshkid Programming 1 04-05-2007 06:16 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:54 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration