Quote:
Originally Posted by rblampain
Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.
|
Strictly speaking, no. When saving ASCII characters in 8-bit bytes, you can use the topmost bit, bit 7, for other purposes, because ASCII is only a 7-bit character set, and therefore only defines code points 0 through 127. However, you then need to always strip (zero) the bit when referring to the ASCII characters.
In other words, while code point 65 does correspond to ASCII 'A', code point 65+128=193 does not correspond to ASCII 'A', and is not shown as 'A' in any implementations I know of. So, you cannot really use the high bit for custom purposes, unless you always strip the bit before referring to the data as a string.
Quote:
Originally Posted by rblampain
Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?
|
ASCII does not define code points 128..255. ISO Latin variants and most Windows character sets have some undefined code points between 128 and 159. Some implementations (like iconv using default settings) will yield an error when encountering undefined code points, but most will just display some random glyph; usually a placeholder or a question mark of some sort.
ASCII, ISO Latin, Windows Western European, and UTF-8 all have the same control characters at code points 0 (NUL) through 31 (US, unit separator), of which only 7..13 and 27 are commonly used. UTF-8 has an additional 32 control characters at code points 128 through 159, but I have never encountered them. See
Wikibooks for example. Variants of UTF-8 allow encoding any ASCII character using two bytes (192, code+128), which any UTF-8 consumer should understand as ASCII characters; widely used
modified UTF-8 allows this for the NUL (ASCII zero) only.
Honestly, I'd say your assumption is invalid even for ASCII. While some variant may be possible in specific character sets, the entire idea -- considering the possible incompatibilities it likely introduces -- sounds a bit loony to me. You can always use a custom character encoding, or escape sequences, if you need to embed extra information into a string. If you need a fixed number of additional bits per each character or byte, use parallel but separate storage.
Here is one escape sequence mapping I've used:
Code:
\ \/
< \{
> \}
& \?
" \,
; \.
It is trivial to apply (escape) and reverse (de-escape), and aside from backslash, none of the escaped characters appear in the escaped data; thus you can use the five characters for custom markup. You can trivially check if data has already been escaped properly. Overhead (processing and extra space used) is minimal.