Bit 7 in character sets.
Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.
Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general? Thank you for your help. |
Hi there,
Quote:
It's true that ASCII only defines a 7bit range, the MSB always being 0. There are many character encodings, however, that extend the ASCII set using the upper 128 bit patterns as well, like the IBM PC (OEM) set, the ISO-8859-x family and many more. Sometimes, these are also called ASCII, which is not correct - rather, they are supersets of ASCII. It's also true that if an arbitrary system uses ASCII only, programmers might think about using the MSB for some special purpose. By the way, I can't think of any example where this is actually done, so AFAIS that's more a theoretical approach. Quote:
The important thing to know is that UTF-8 uses the MSB to indicate whether a byte directly maps to a character (MSB is 0) or is part of a group of bytes belonging together to form one character. One single character is encoded with 1..4 bytes in UTF-8. Hope I could help you with that one. [X] Doc CPU |
Quote:
In other words, while code point 65 does correspond to ASCII 'A', code point 65+128=193 does not correspond to ASCII 'A', and is not shown as 'A' in any implementations I know of. So, you cannot really use the high bit for custom purposes, unless you always strip the bit before referring to the data as a string. Quote:
ASCII, ISO Latin, Windows Western European, and UTF-8 all have the same control characters at code points 0 (NUL) through 31 (US, unit separator), of which only 7..13 and 27 are commonly used. UTF-8 has an additional 32 control characters at code points 128 through 159, but I have never encountered them. See Wikibooks for example. Variants of UTF-8 allow encoding any ASCII character using two bytes (192, code+128), which any UTF-8 consumer should understand as ASCII characters; widely used modified UTF-8 allows this for the NUL (ASCII zero) only. Honestly, I'd say your assumption is invalid even for ASCII. While some variant may be possible in specific character sets, the entire idea -- considering the possible incompatibilities it likely introduces -- sounds a bit loony to me. You can always use a custom character encoding, or escape sequences, if you need to embed extra information into a string. If you need a fixed number of additional bits per each character or byte, use parallel but separate storage. Here is one escape sequence mapping I've used: Code:
\ \/ |
Quote:
|
Quote:
|
All times are GMT -5. The time now is 12:08 AM. |