Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
05-21-2006, 10:47 PM
|
#1
|
Member
Registered: Oct 2003
Posts: 354
Rep:
|
about wide character and multiple byte character
Hello everyone,
In my mind, wide character means in the encoding scheme, each character is encoded with the same number of bytes (compared with encoding technology which uses various number of bytes to encode a character). I have also made some references that it seems that wide character is the same as unicode (both are using 4 bytes to encode a character). Is that correct?
There is also a term called multiple byte character. What does it mean -- the same as wide character? Does it has any relationship with UTF-8 encoding (since UTF-8 also uses multiple bytes to encode a character)?
thanks in advance,
George
|
|
|
05-22-2006, 12:10 AM
|
#2
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
|
|
|
05-22-2006, 09:29 AM
|
#3
|
Member
Registered: Oct 2003
Posts: 354
Original Poster
Rep:
|
Thank you David,
Quote:
Originally Posted by David the H.
|
I have read through some of them and they are very helpful. Ater reading the materials, I still have a question. In my mind, I think wide character should be the same as unicode -- which is 4 byte character. But on Windows platform, why it is 2 byte length?
regards,
George
|
|
|
05-22-2006, 12:01 PM
|
#4
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
I'm really no expert on this. All I know is what I read from the above links, so I'm probably getting in over my head here. But if I'm following it right, all "wide" really means is a character (or other datatype) that's bigger than one byte. So anything that's two bytes or more could be considered "wide", with the exact definition depending on the system and encoding used. Windows is wide at two bytes simply because they define it as such. And as the link explicitly says, wide is NOT the same as unicode.
And unicode is not 4 bytes exactly. The way I understand it, "unicode" is not even an encoding, it's a character map: all the characters of the world laid out in a giant index, each with a unique character address. In order to use them, you need one of the various character encodings that map these addresses to a byte sequence the computer can use. UTF-8, for example, is a variable-byte encoding that uses from one to four bytes to map each character, depending on its position in the unicode map. UTF-16 is a different, fixed-width encoding that always uses two bytes per character, even for the old one-byte ascii symbols. And so on. So there is no specific "byte number" for unicode. Each encoding system maps to the same character addresses, but each does it differently and with a different byte number.
Edit: By the way, the links point out that the WindowsNT family uses the UTF-16 encoding internally which is two bytes fixed, while most Unix systems are UTF-8, which is up to 4 bytes variable. This would explain your question as to why they are different.
Last edited by David the H.; 05-22-2006 at 12:11 PM.
|
|
|
05-22-2006, 08:26 PM
|
#5
|
Member
Registered: Oct 2003
Posts: 354
Original Poster
Rep:
|
Thank you David!
Quote:
Originally Posted by David the H.
I'm really no expert on this. All I know is what I read from the above links, so I'm probably getting in over my head here. But if I'm following it right, all "wide" really means is a character (or other datatype) that's bigger than one byte. So anything that's two bytes or more could be considered "wide", with the exact definition depending on the system and encoding used. Windows is wide at two bytes simply because they define it as such. And as the link explicitly says, wide is NOT the same as unicode.
And unicode is not 4 bytes exactly. The way I understand it, "unicode" is not even an encoding, it's a character map: all the characters of the world laid out in a giant index, each with a unique character address. In order to use them, you need one of the various character encodings that map these addresses to a byte sequence the computer can use. UTF-8, for example, is a variable-byte encoding that uses from one to four bytes to map each character, depending on its position in the unicode map. UTF-16 is a different, fixed-width encoding that always uses two bytes per character, even for the old one-byte ascii symbols. And so on. So there is no specific "byte number" for unicode. Each encoding system maps to the same character addresses, but each does it differently and with a different byte number.
Edit: By the way, the links point out that the WindowsNT family uses the UTF-16 encoding internally which is two bytes fixed, while most Unix systems are UTF-8, which is up to 4 bytes variable. This would explain your question as to why they are different.
|
Your reply is so great -- more clear than the ones on MSDN!
regards,
George
|
|
|
05-23-2006, 01:03 AM
|
#6
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
I'm glad I could help. It is hard to get your head around it all, especially if you aren't a programmer, like me.
It looks like I did make a mistake though in my last explanation. I said that UTF-16 was fixed width, but I was wrong. Closer reading tells me that it is indeed variable, but that the sequences are always broken up into equal length 16 bit words. UTF-32, however, is a four byte fixed width encoding. Again, this is assuming I'm following it right.
|
|
|
All times are GMT -5. The time now is 02:35 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|