Bit 7 in character sets.

rblampain · 10-29-2011, 09:41 AM

Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.

Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?

Thank you for your help.

Doc CPU · 10-29-2011, 10:37 AM

Hi there,

Quote:

Originally Posted by rblampain

Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.

where did you pick up that statement? It's not true, or at least wildly inaccurate.
It's true that ASCII only defines a 7bit range, the MSB always being 0. There are many character encodings, however, that extend the ASCII set using the upper 128 bit patterns as well, like the IBM PC (OEM) set, the ISO-8859-x family and many more. Sometimes, these are also called ASCII, which is not correct - rather, they are supersets of ASCII.

It's also true that if an arbitrary system uses ASCII only, programmers might think about using the MSB for some special purpose. By the way, I can't think of any example where this is actually done, so AFAIS that's more a theoretical approach.

Quote:

Originally Posted by rblampain

Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?

I don't know about Asian or other non-European and non-US encodings. But UTF-8 is a very tricky matter. It represents ASCII characters by their ASCII code (with the MSB 0), and encodes all other characters with more than one byte. See Wikipedia for details.
The important thing to know is that UTF-8 uses the MSB to indicate whether a byte directly maps to a character (MSB is 0) or is part of a group of bytes belonging together to form one character. One single character is encoded with 1..4 bytes in UTF-8.

Hope I could help you with that one.

[X] Doc CPU

Nominal Animal · 10-29-2011, 10:44 AM

Quote:

Originally Posted by rblampain

Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.

Strictly speaking, no. When saving ASCII characters in 8-bit bytes, you can use the topmost bit, bit 7, for other purposes, because ASCII is only a 7-bit character set, and therefore only defines code points 0 through 127. However, you then need to always strip (zero) the bit when referring to the ASCII characters.

In other words, while code point 65 does correspond to ASCII 'A', code point 65+128=193 does not correspond to ASCII 'A', and is not shown as 'A' in any implementations I know of. So, you cannot really use the high bit for custom purposes, unless you always strip the bit before referring to the data as a string.

Quote:

Originally Posted by rblampain

Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?

ASCII does not define code points 128..255. ISO Latin variants and most Windows character sets have some undefined code points between 128 and 159. Some implementations (like iconv using default settings) will yield an error when encountering undefined code points, but most will just display some random glyph; usually a placeholder or a question mark of some sort.

ASCII, ISO Latin, Windows Western European, and UTF-8 all have the same control characters at code points 0 (NUL) through 31 (US, unit separator), of which only 7..13 and 27 are commonly used. UTF-8 has an additional 32 control characters at code points 128 through 159, but I have never encountered them. See Wikibooks for example. Variants of UTF-8 allow encoding any ASCII character using two bytes (192, code+128), which any UTF-8 consumer should understand as ASCII characters; widely used modified UTF-8 allows this for the NUL (ASCII zero) only.

Honestly, I'd say your assumption is invalid even for ASCII. While some variant may be possible in specific character sets, the entire idea -- considering the possible incompatibilities it likely introduces -- sounds a bit loony to me. You can always use a custom character encoding, or escape sequences, if you need to embed extra information into a string. If you need a fixed number of additional bits per each character or byte, use parallel but separate storage.

Here is one escape sequence mapping I've used:

Code:

    \        \/
    <        \{
    >        \}
    &        \?
    "        \,
    ;        \.

It is trivial to apply (escape) and reverse (de-escape), and aside from backslash, none of the escaped characters appear in the escaped data; thus you can use the five characters for custom markup. You can trivially check if data has already been escaped properly. Overhead (processing and extra space used) is minimal.

SigTerm · 10-29-2011, 10:54 AM

Quote:

Originally Posted by rblampain

Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?

No, it doesn't. ASCII-compatible 8bit encodings use 7th bit for non-ascii characters. UTF8 uses 7th bit to encode data. You should quickly forget about that idea, and use something else as a "marker for progrmaming purpose". See how printf/QString inserts arguments into string, for example.

dugan · 10-30-2011, 09:56 AM

Quote:

Originally Posted by rblampain

Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.

This is not true for platforms that use an Extended ASCII character set.