LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-29-2011, 09:41 AM   #1
rblampain
Senior Member
 
Registered: Aug 2004
Location: Western Australia
Distribution: Debian 11
Posts: 1,288

Rep: Reputation: 52
Bit 7 in character sets.


Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.

Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?

Thank you for your help.
 
Old 10-29-2011, 10:37 AM   #2
Doc CPU
Senior Member
 
Registered: Jun 2011
Location: Stuttgart, Germany
Distribution: Mint, Debian, Gentoo, Win 2k/XP
Posts: 1,099

Rep: Reputation: 344Reputation: 344Reputation: 344Reputation: 344
Hi there,

Quote:
Originally Posted by rblampain View Post
Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.
where did you pick up that statement? It's not true, or at least wildly inaccurate.
It's true that ASCII only defines a 7bit range, the MSB always being 0. There are many character encodings, however, that extend the ASCII set using the upper 128 bit patterns as well, like the IBM PC (OEM) set, the ISO-8859-x family and many more. Sometimes, these are also called ASCII, which is not correct - rather, they are supersets of ASCII.

It's also true that if an arbitrary system uses ASCII only, programmers might think about using the MSB for some special purpose. By the way, I can't think of any example where this is actually done, so AFAIS that's more a theoretical approach.

Quote:
Originally Posted by rblampain View Post
Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?
I don't know about Asian or other non-European and non-US encodings. But UTF-8 is a very tricky matter. It represents ASCII characters by their ASCII code (with the MSB 0), and encodes all other characters with more than one byte. See Wikipedia for details.
The important thing to know is that UTF-8 uses the MSB to indicate whether a byte directly maps to a character (MSB is 0) or is part of a group of bytes belonging together to form one character. One single character is encoded with 1..4 bytes in UTF-8.

Hope I could help you with that one.

[X] Doc CPU
 
1 members found this post helpful.
Old 10-29-2011, 10:44 AM   #3
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by rblampain View Post
Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.
Strictly speaking, no. When saving ASCII characters in 8-bit bytes, you can use the topmost bit, bit 7, for other purposes, because ASCII is only a 7-bit character set, and therefore only defines code points 0 through 127. However, you then need to always strip (zero) the bit when referring to the ASCII characters.

In other words, while code point 65 does correspond to ASCII 'A', code point 65+128=193 does not correspond to ASCII 'A', and is not shown as 'A' in any implementations I know of. So, you cannot really use the high bit for custom purposes, unless you always strip the bit before referring to the data as a string.

Quote:
Originally Posted by rblampain View Post
Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?
ASCII does not define code points 128..255. ISO Latin variants and most Windows character sets have some undefined code points between 128 and 159. Some implementations (like iconv using default settings) will yield an error when encountering undefined code points, but most will just display some random glyph; usually a placeholder or a question mark of some sort.

ASCII, ISO Latin, Windows Western European, and UTF-8 all have the same control characters at code points 0 (NUL) through 31 (US, unit separator), of which only 7..13 and 27 are commonly used. UTF-8 has an additional 32 control characters at code points 128 through 159, but I have never encountered them. See Wikibooks for example. Variants of UTF-8 allow encoding any ASCII character using two bytes (192, code+128), which any UTF-8 consumer should understand as ASCII characters; widely used modified UTF-8 allows this for the NUL (ASCII zero) only.

Honestly, I'd say your assumption is invalid even for ASCII. While some variant may be possible in specific character sets, the entire idea -- considering the possible incompatibilities it likely introduces -- sounds a bit loony to me. You can always use a custom character encoding, or escape sequences, if you need to embed extra information into a string. If you need a fixed number of additional bits per each character or byte, use parallel but separate storage.

Here is one escape sequence mapping I've used:
Code:
    \        \/
    <        \{
    >        \}
    &        \?
    "        \,
    ;        \.
It is trivial to apply (escape) and reverse (de-escape), and aside from backslash, none of the escaped characters appear in the escaped data; thus you can use the five characters for custom markup. You can trivially check if data has already been escaped properly. Overhead (processing and extra space used) is minimal.
 
1 members found this post helpful.
Old 10-29-2011, 10:54 AM   #4
SigTerm
Member
 
Registered: Dec 2009
Distribution: Slackware 12.2
Posts: 379

Rep: Reputation: 234Reputation: 234Reputation: 234
Quote:
Originally Posted by rblampain View Post
Does anyone knows if the same rule applies for other character sets (Arabic etc)? Or in UTF-8 in general?
No, it doesn't. ASCII-compatible 8bit encodings use 7th bit for non-ascii characters. UTF8 uses 7th bit to encode data. You should quickly forget about that idea, and use something else as a "marker for progrmaming purpose". See how printf/QString inserts arguments into string, for example.
 
1 members found this post helpful.
Old 10-30-2011, 09:56 AM   #5
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,219

Rep: Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309Reputation: 5309
Quote:
Originally Posted by rblampain View Post
Bit 7 of the ascii character set can be used for programming purposes because it represents non-enterable characters if set in a byte.
This is not true for platforms that use an Extended ASCII character set.

Last edited by dugan; 10-30-2011 at 10:08 AM.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Confusion while using character sets in egrep luvshines Programming 9 02-03-2011 04:56 AM
Webmail and other character sets nextekcarl Linux - General 0 04-28-2007 11:58 AM
Samba character sets 1337_penguin Linux - Networking 0 03-22-2007 05:25 PM
550 Error: No foreign character sets, please henryvl Linux - Networking 2 03-05-2006 06:47 PM
changing character sets on my console?(SuSe 7.3) Fin7PL SUSE / openSUSE 1 02-27-2006 08:36 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:37 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration