Code explanation ... ?

dbee · 12-16-2005, 11:53 PM

So I'm working in korea at the moment and when I print out documents I sometimes get a mixture of English and Hangul (Korean) characters. I'm reasonably familiar with the unicode specs, but I'd like to ask a couple of questions all the same just for clarification.

1) So ascii was basically a one byte, 256 character code page with the first 1-127 being the most popular printable characters right ? Unicode was then an expansion of that to a 4 byte code page with virtually every character that was ever written and known about ?

2) UTF8 is pretty much the same as unicode right ?

3) But instead of the Koreans having to use a 4 byte character encoding for all of their text files, unicode allows them to switch the 'charsets' and just to use the first 256 characters like we do (or did) with ascii ?

4) why do my sheets then print a little bit of both ?

5) also ... if I print out a data file to my screen why do I get large amounts of the same weird character, instead a mix of letters, numbers and characters that I'd image I would get if I printed out random characters from 1-256 on an ascii codepage ?

Again, I'm eager to understand this a little better so if anyone can point out where I've gone wrong here I'd be much obliged.

Thanks

spooon · 12-17-2005, 01:06 AM

Quote:

Originally Posted by dbee

1) So ascii was basically a one byte, 256 character code page with the first 1-127 being the most popular printable characters right ?

No, ASCII is only 7-bit (0-127). 32-126 are the printable characters. ASCII forms the first 128 characters of Unicode.

Quote:

Originally Posted by dbee

Unicode was then an expansion of that to a 4 byte code page with virtually every character that was ever written and known about ?

No, Unicode is an abstract of assignment of integers to characters and has nothing to do with how things are represented in data. That is specified by specific encodings (like UTF-8).

Quote:

Originally Posted by dbee

2) UTF8 is pretty much the same as unicode right ?

3) But instead of the Koreans having to use a 4 byte character encoding for all of their text files, unicode allows them to switch the 'charsets' and just to use the first 256 characters like we do (or did) with ascii ?

4) why do my sheets then print a little bit of both ?

No, UTF-8 is a specific Unicode encoding. It is the most popular encoding because it is ASCII-compatible (i.e. ASCII characters are represented using 1 byte the same way that ASCII is represented, so that ASCII text is automatically also UTF-8). It uses 1, 2, 3, or 4 bytes depending on the character. So it is not "switching" between anything at all; it is natural for different characters to have different widths in UTF-8.

Quote:

Originally Posted by dbee

5) also ... if I print out a data file to my screen why do I get large amounts of the same weird character, instead a mix of letters, numbers and characters that I'd image I would get if I printed out random characters from 1-256 on an ascii codepage ?

I am not sure. What character is this?

There are many other Unicode encodings, like UTF-16 (which uses 2 or 4 bytes per character, but is more efficient than UTF-8 for Asian characters), UTF-32 (which uses 4 bytes per character), etc. If you try to view stuff with the wrong encoding it will show weird things, e.g. if you view UTF-16 text with UTF-8, it will often have an extra garbage character between every character.