LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   about wide character and multiple byte character (https://www.linuxquestions.org/questions/programming-9/about-wide-character-and-multiple-byte-character-447222/)

George2 05-21-2006 10:47 PM

about wide character and multiple byte character
 
Hello everyone,


In my mind, wide character means in the encoding scheme, each character is encoded with the same number of bytes (compared with encoding technology which uses various number of bytes to encode a character). I have also made some references that it seems that wide character is the same as unicode (both are using 4 bytes to encode a character). Is that correct?

There is also a term called multiple byte character. What does it mean -- the same as wide character? Does it has any relationship with UTF-8 encoding (since UTF-8 also uses multiple bytes to encode a character)?


thanks in advance,
George

David the H. 05-22-2006 12:10 AM

Wikipedia has a very good rundown on character encodings:

Character Encoding

Wide Character Encoding

Variable-Width Encoding

Unicode

UTF-8

George2 05-22-2006 09:29 AM

Thank you David,


Quote:

Originally Posted by David the H.
Wikipedia has a very good rundown on character encodings:

Character Encoding

Wide Character Encoding

Variable-Width Encoding

Unicode

UTF-8

I have read through some of them and they are very helpful. Ater reading the materials, I still have a question. In my mind, I think wide character should be the same as unicode -- which is 4 byte character. But on Windows platform, why it is 2 byte length?


regards,
George

David the H. 05-22-2006 12:01 PM

I'm really no expert on this. All I know is what I read from the above links, so I'm probably getting in over my head here. But if I'm following it right, all "wide" really means is a character (or other datatype) that's bigger than one byte. So anything that's two bytes or more could be considered "wide", with the exact definition depending on the system and encoding used. Windows is wide at two bytes simply because they define it as such. And as the link explicitly says, wide is NOT the same as unicode.

And unicode is not 4 bytes exactly. The way I understand it, "unicode" is not even an encoding, it's a character map: all the characters of the world laid out in a giant index, each with a unique character address. In order to use them, you need one of the various character encodings that map these addresses to a byte sequence the computer can use. UTF-8, for example, is a variable-byte encoding that uses from one to four bytes to map each character, depending on its position in the unicode map. UTF-16 is a different, fixed-width encoding that always uses two bytes per character, even for the old one-byte ascii symbols. And so on. So there is no specific "byte number" for unicode. Each encoding system maps to the same character addresses, but each does it differently and with a different byte number.

Edit: By the way, the links point out that the WindowsNT family uses the UTF-16 encoding internally which is two bytes fixed, while most Unix systems are UTF-8, which is up to 4 bytes variable. This would explain your question as to why they are different.

George2 05-22-2006 08:26 PM

Thank you David!


Quote:

Originally Posted by David the H.
I'm really no expert on this. All I know is what I read from the above links, so I'm probably getting in over my head here. But if I'm following it right, all "wide" really means is a character (or other datatype) that's bigger than one byte. So anything that's two bytes or more could be considered "wide", with the exact definition depending on the system and encoding used. Windows is wide at two bytes simply because they define it as such. And as the link explicitly says, wide is NOT the same as unicode.

And unicode is not 4 bytes exactly. The way I understand it, "unicode" is not even an encoding, it's a character map: all the characters of the world laid out in a giant index, each with a unique character address. In order to use them, you need one of the various character encodings that map these addresses to a byte sequence the computer can use. UTF-8, for example, is a variable-byte encoding that uses from one to four bytes to map each character, depending on its position in the unicode map. UTF-16 is a different, fixed-width encoding that always uses two bytes per character, even for the old one-byte ascii symbols. And so on. So there is no specific "byte number" for unicode. Each encoding system maps to the same character addresses, but each does it differently and with a different byte number.

Edit: By the way, the links point out that the WindowsNT family uses the UTF-16 encoding internally which is two bytes fixed, while most Unix systems are UTF-8, which is up to 4 bytes variable. This would explain your question as to why they are different.

Your reply is so great -- more clear than the ones on MSDN!


regards,
George

David the H. 05-23-2006 01:03 AM

I'm glad I could help. It is hard to get your head around it all, especially if you aren't a programmer, like me.

It looks like I did make a mistake though in my last explanation. I said that UTF-16 was fixed width, but I was wrong. Closer reading tells me that it is indeed variable, but that the sequences are always broken up into equal length 16 bit words. UTF-32, however, is a four byte fixed width encoding. Again, this is assuming I'm following it right.


All times are GMT -5. The time now is 12:51 AM.