LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 01-14-2007, 06:15 PM   #1
merc64
LQ Newbie
 
Registered: Jan 2007
Posts: 4

Rep: Reputation: 0
unicode/japanese characters in C


I'm trying to get the hex value for a given UTF-8 character in C like this:
Code:
int main(int argc, char **argv) {
  if (argc > 1)
    printf("%X\n", ((wchar_t*)argv[1])[0]);
  return 0;
}
I'm running it from xterm with the UTF-8 character at 0x9858 as the argument, but I'm instead getting 98A1E9 as the output. I tried some other characters and got 9BA1E9 from 0x985B and 8080E9 from 0x9000.

My locale is en_US.UTF-8. Can anyone tell me what's happening to the encoding or whether there's any hope of getting the hex value I need?
 
Old 01-14-2007, 06:49 PM   #2
psisquare
Member
 
Registered: Sep 2004
Location: Germany
Distribution: Gentoo
Posts: 164

Rep: Reputation: 31
You have to distinguish "Unicode" and "UTF-8". What you are putting in are Unicode codepoints; what you are getting is the (presumably correct) UTF-8 encoding thereof. See http://unicode.org/unicode/faq/utf_bom.html#UTF8 for links to the definition and sample code.
 
Old 01-15-2007, 08:53 PM   #3
merc64
LQ Newbie
 
Registered: Jan 2007
Posts: 4

Original Poster
Rep: Reputation: 0
I understood that UTF-8 was not synonymous with Unicode, but I didn't realize that Unicode codepoints were a system in their own right. Is there an official name/abbreviation for the codepoint system? Are there any libraries to help convert safely and easily (seems like iconv or the mbsrtowcs, etc. functions might help, but I can't find documentation)?
 
Old 01-16-2007, 07:46 PM   #4
merc64
LQ Newbie
 
Registered: Jan 2007
Posts: 4

Original Poster
Rep: Reputation: 0
I still don't have my answer, but I'm looking through a doc here: http://www.tacc.utexas.edu/resources...g/booktoc2.htm that seems like it might help. Hope that helps anyone else looking for the same answers.
 
Old 01-16-2007, 10:19 PM   #5
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,012

Rep: Reputation: 111Reputation: 111
I think the term is "UTF-16"
 
Old 01-17-2007, 06:46 AM   #6
psisquare
Member
 
Registered: Sep 2004
Location: Germany
Distribution: Gentoo
Posts: 164

Rep: Reputation: 31
Actually, UTF-16 is different again from both UTF-8 and codepoints (also explained in the link I posted). I'm not exactly an expert on this, but I'll try to help as far as I can.

I think you could get much of the desired effect with UTF-32, at the cost of allocating four bytes for every single character. There's also something called UCS-4, but don't ask me about details here. Depending on what you actually want to do with the characters, you could either try to write everything from scratch based on the standard (UTF-8 doesn't look terribly complicated to me), or, as you said, use one of the available libraries.

Documentation for iconv is available via "man 3 iconv" and "info iconv". Depending on your distribution, you may need to install glibc-devel, man-pages or similar for this.

Also, Qt and Glib have facilities for handling Unicode text and there's the dedicated libunicode.
 
Old 03-13-2007, 07:00 PM   #7
merc64
LQ Newbie
 
Registered: Jan 2007
Posts: 4

Original Poster
Rep: Reputation: 0
Okay, the tacc.utexas.edu link wasn't much help to me, but I did find something at http://www.gnu.org/software/libc/man...-Handling.html that looks a bit more promising. Looks like libc itself should do just fine. Thanks for the info psisquare.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Yahoo AddressBook Unicode Characters shreks Linux - Software 0 11-26-2006 09:38 PM
Configuring keymap and obtaining ALL unicode characters Svip Linux - General 1 06-10-2006 10:09 AM
Unicode characters in Firefox Ephracis Linux - Software 6 10-14-2005 04:05 PM
Unicode characters looking weird in amaroK Per Linux - Software 0 03-15-2005 02:50 PM
printing unicode characters in JAVA Armand Programming 8 03-06-2004 07:51 PM


All times are GMT -5. The time now is 03:08 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration