unicode/japanese characters in C

merc64 · 01-14-2007, 06:15 PM

I'm trying to get the hex value for a given UTF-8 character in C like this:

Code:

int main(int argc, char **argv) {
  if (argc > 1)
    printf("%X\n", ((wchar_t*)argv[1])[0]);
  return 0;
}

I'm running it from xterm with the UTF-8 character at 0x9858 as the argument, but I'm instead getting 98A1E9 as the output. I tried some other characters and got 9BA1E9 from 0x985B and 8080E9 from 0x9000.

My locale is en_US.UTF-8. Can anyone tell me what's happening to the encoding or whether there's any hope of getting the hex value I need?

psisquare · 01-14-2007, 06:49 PM

You have to distinguish "Unicode" and "UTF-8". What you are putting in are Unicode codepoints; what you are getting is the (presumably correct) UTF-8 encoding thereof. See http://unicode.org/unicode/faq/utf_bom.html#UTF8 for links to the definition and sample code.

merc64 · 01-15-2007, 08:53 PM

I understood that UTF-8 was not synonymous with Unicode, but I didn't realize that Unicode codepoints were a system in their own right. Is there an official name/abbreviation for the codepoint system? Are there any libraries to help convert safely and easily (seems like iconv or the mbsrtowcs, etc. functions might help, but I can't find documentation)?

merc64 · 01-16-2007, 07:46 PM

I still don't have my answer, but I'm looking through a doc here: http://www.tacc.utexas.edu/resources...g/booktoc2.htm that seems like it might help. Hope that helps anyone else looking for the same answers.

tuxdev · 01-16-2007, 10:19 PM

I think the term is "UTF-16"

psisquare · 01-17-2007, 06:46 AM

Actually, UTF-16 is different again from both UTF-8 and codepoints (also explained in the link I posted). I'm not exactly an expert on this, but I'll try to help as far as I can.

I think you could get much of the desired effect with UTF-32, at the cost of allocating four bytes for every single character. There's also something called UCS-4, but don't ask me about details here. Depending on what you actually want to do with the characters, you could either try to write everything from scratch based on the standard (UTF-8 doesn't look terribly complicated to me), or, as you said, use one of the available libraries.

Documentation for iconv is available via "man 3 iconv" and "info iconv". Depending on your distribution, you may need to install glibc-devel, man-pages or similar for this.

Also, Qt and Glib have facilities for handling Unicode text and there's the dedicated libunicode.

merc64 · 03-13-2007, 07:00 PM

Okay, the tacc.utexas.edu link wasn't much help to me, but I did find something at http://www.gnu.org/software/libc/man...-Handling.html that looks a bit more promising. Looks like libc itself should do just fine. Thanks for the info psisquare.