LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-21-2006, 10:47 PM   #1
George2
Member
 
Registered: Oct 2003
Posts: 354

Rep: Reputation: 30
about wide character and multiple byte character


Hello everyone,


In my mind, wide character means in the encoding scheme, each character is encoded with the same number of bytes (compared with encoding technology which uses various number of bytes to encode a character). I have also made some references that it seems that wide character is the same as unicode (both are using 4 bytes to encode a character). Is that correct?

There is also a term called multiple byte character. What does it mean -- the same as wide character? Does it has any relationship with UTF-8 encoding (since UTF-8 also uses multiple bytes to encode a character)?


thanks in advance,
George
 
Old 05-22-2006, 12:10 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Wikipedia has a very good rundown on character encodings:

Character Encoding

Wide Character Encoding

Variable-Width Encoding

Unicode

UTF-8
 
Old 05-22-2006, 09:29 AM   #3
George2
Member
 
Registered: Oct 2003
Posts: 354

Original Poster
Rep: Reputation: 30
Thank you David,


Quote:
Originally Posted by David the H.
Wikipedia has a very good rundown on character encodings:

Character Encoding

Wide Character Encoding

Variable-Width Encoding

Unicode

UTF-8
I have read through some of them and they are very helpful. Ater reading the materials, I still have a question. In my mind, I think wide character should be the same as unicode -- which is 4 byte character. But on Windows platform, why it is 2 byte length?


regards,
George
 
Old 05-22-2006, 12:01 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I'm really no expert on this. All I know is what I read from the above links, so I'm probably getting in over my head here. But if I'm following it right, all "wide" really means is a character (or other datatype) that's bigger than one byte. So anything that's two bytes or more could be considered "wide", with the exact definition depending on the system and encoding used. Windows is wide at two bytes simply because they define it as such. And as the link explicitly says, wide is NOT the same as unicode.

And unicode is not 4 bytes exactly. The way I understand it, "unicode" is not even an encoding, it's a character map: all the characters of the world laid out in a giant index, each with a unique character address. In order to use them, you need one of the various character encodings that map these addresses to a byte sequence the computer can use. UTF-8, for example, is a variable-byte encoding that uses from one to four bytes to map each character, depending on its position in the unicode map. UTF-16 is a different, fixed-width encoding that always uses two bytes per character, even for the old one-byte ascii symbols. And so on. So there is no specific "byte number" for unicode. Each encoding system maps to the same character addresses, but each does it differently and with a different byte number.

Edit: By the way, the links point out that the WindowsNT family uses the UTF-16 encoding internally which is two bytes fixed, while most Unix systems are UTF-8, which is up to 4 bytes variable. This would explain your question as to why they are different.

Last edited by David the H.; 05-22-2006 at 12:11 PM.
 
Old 05-22-2006, 08:26 PM   #5
George2
Member
 
Registered: Oct 2003
Posts: 354

Original Poster
Rep: Reputation: 30
Thank you David!


Quote:
Originally Posted by David the H.
I'm really no expert on this. All I know is what I read from the above links, so I'm probably getting in over my head here. But if I'm following it right, all "wide" really means is a character (or other datatype) that's bigger than one byte. So anything that's two bytes or more could be considered "wide", with the exact definition depending on the system and encoding used. Windows is wide at two bytes simply because they define it as such. And as the link explicitly says, wide is NOT the same as unicode.

And unicode is not 4 bytes exactly. The way I understand it, "unicode" is not even an encoding, it's a character map: all the characters of the world laid out in a giant index, each with a unique character address. In order to use them, you need one of the various character encodings that map these addresses to a byte sequence the computer can use. UTF-8, for example, is a variable-byte encoding that uses from one to four bytes to map each character, depending on its position in the unicode map. UTF-16 is a different, fixed-width encoding that always uses two bytes per character, even for the old one-byte ascii symbols. And so on. So there is no specific "byte number" for unicode. Each encoding system maps to the same character addresses, but each does it differently and with a different byte number.

Edit: By the way, the links point out that the WindowsNT family uses the UTF-16 encoding internally which is two bytes fixed, while most Unix systems are UTF-8, which is up to 4 bytes variable. This would explain your question as to why they are different.
Your reply is so great -- more clear than the ones on MSDN!


regards,
George
 
Old 05-23-2006, 01:03 AM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I'm glad I could help. It is hard to get your head around it all, especially if you aren't a programmer, like me.

It looks like I did make a mistake though in my last explanation. I said that UTF-16 was fixed width, but I was wrong. Closer reading tells me that it is indeed variable, but that the sequences are always broken up into equal length 16 bit words. UTF-32, however, is a four byte fixed width encoding. Again, this is assuming I'm following it right.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
ncurses wide character/extended ASCII problem Hosiah Programming 3 03-03-2009 12:55 PM
double byte character kjsubbu Linux - General 5 02-04-2005 05:07 PM
cut with multiple character delimeter krock923 Programming 9 12-21-2004 07:38 PM
winkey mapped to multiple character input g-off Linux - General 0 02-24-2004 02:26 AM
Invalid or incomplete multibyte or wide character chanys Linux - General 1 03-13-2003 10:09 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:35 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration