My intention here is to explain some things that are frequently misunderstood in the field of computing. Usually, these will involve something technical, and will be of concern to programmers and others who use computers in ways that are beyond the end-user level.

A Bit About Digital, Part II

Posted 07-08-2012 at 04:47 PM by theNbomr

In Part I, we started to describe how computers use a binary system of storing and manipulating digital data. This is a continuation of that subject.

What we see isn't really what we have.

When we read text on our computer screen, we often see numbers like 0 and 1, just as I've used many times in this text. It may come as a surprise to some that when we see the glyph on the screen that we recognize as a zero, it is actually put there because the underlying byte that represents it is not numerically a zero. In order for a computer terminal or printer to display that little glyph, we actually had to send it the value 3210. This is a byproduct of machinery which was used early on in the computing field to render human readable versions of numeric and or text data.

Terminals and printers are used to produce a visual representation of the bytes we send them. Somewhere in the early history of computing, someone had the good sense to create a widely used standard for the encoding of printable and other bytes that terminals, printers, and some other devices use. That standard defined what we call ASCII encoding. It allows devices like terminals, printers, modems, etc., to be at least somewhat interchangeable.

What the ASCII standard defines is how bytes are used to render all of the various uppercase and lowercase characters, number and punctuation marks. It also defines what a lot of the non-printing bytes do, such as carriage returns and linefeeds, tabs, backspaces, and delete keys. The standard was fairly straight forward, spelling out what bytes 0 to 127 did. For those of you who didn't do the mental gymnastics right away, I'll make it a little easier: the standard spells out the purpose of all of the bytes from 0 to 7F16. Does that make it easier to see that only the bytes with the most significant bit set to 0 were spelled out? This gave rise to the nothing of 7-bit data.

7-bit data made a lot of sense at one time, when communication was done via slow modems. Saving one bit in each byte transmitted was an automatic 12.5% increase in speed. Since there was nothing any device could use the high bit for anyway, there seemed to be little reason to use it. It also provides the final bit of information for the choice of the delete key encoding. One type of ASCII device that was used commonly used was the paper tape writer/reader. It worked by punching a series of holes in a paper tape. Each of the seven holes represented a bit in a byte. If the hole for a particular bit was left not punched, it represented a zero bit, and if it was punched through the paper it was a one bit. Since a bit could never be unpunched, it was impossible to arbitrarily modify the value of a byte. However, the value of a byte could always be changed to 'all bits set to one', by punching every bit out of the tape. This allowed the byte to be treated as ignored, and so you could 'delete' any byte on the tape that way.

Back to the point about the ASCII values not corresponding to their human readable representations. The printable characters in the ASCII table are all of those with values greater than or equal to 32. The upper and lower case characters are arranged in the table so that they differ in only one bit per alphabetic character. This makes it easy to convert between upper and lower case; simply set bit 6 of an upper case alphabetic character's ASCII value to make it lower case. The values 0-31 are assigned to special purposes. Most terminals can generate the first 26 ASCII bytes by typing the 'control' version of each alpha character. For instance ASCII byte 3 is often emitted by holding the control key and pressing 'C'; Ctrl-C for short.

It can be a source of confusion that the printable representation of an eight-bit byte may take as many as three bytes to display. For instance, to display a byte with the bit pattern '01110110' using printable ASCII characters will require the three ASCII bytes '2116', 2216', & '2316'. These are the characters '1'. '2' and '3'. What's more, it can be tricky to make the calculation that converts the raw value to the string of printable characters that lets us visualize the value, at least in very low level languages such as assembler. In contrast, the process to convert from the raw value to a printable string is simple to do using the hexadecimal or even octal radices.

Common Language

In computing there is a lexicon that is commonly used to describe the nature of digital data, especially in the numeric sense. Some of the commonly used expressions are technically inaccurate, but it is nevertheless the standard way to describe certain situations. To the uninitiated, this language can be misleading, and I want to explain a bit about how the language is used and what it is used to describe.

One of the commonly confusing uses relates to the word 'binary'. When we refer to 'binary' data, it is generally used to describe a data set that is not generally 'human readable'. It contains a varied mix of all data bytes, both printable and non-printable. Usually this describes something like an object file of compiled code. While there may be some embedded strings of printable data, it is not of the sort that could be correctly modified by using something like a text editor such as vi, or in Windows, notepad. The bytes are meaningful to certain kinds of programs, and are almost always created by programs with certain purposes. We sometimes want to visualize these kinds of data, and for that we use tools that are capable of translating the bytes or multi-byte words into some orderly arrangement of human-readable text. In Linux, one such general purpose tool is the program 'od'.

We can use od to display any data file in a great variety of formats. It is probably instructive to take a few minutes to play around with od to display some 'binary' file in a few interesting ways. Doing so may help to drive home the distinction between the value of a byte versus the manner in which we visualize it. od allows us to display a data file in multiple formats, side-by-side. This can make it clear how a bytes value may have different meanings, depending on the context.

Code:

dd if=/dev/sda count=1 bs=512 | od -tx1c

0000000  fa  b8  00  10  8e  d0  bc  00  b0  b8  00  00  8e  d8  8e  c0

        372 270  \0 020 216 320 274  \0 260 270  \0  \0 216 330 216 300

0000020  fb  be  00  7c  bf  00  06  b9  00  02  f3  a4  ea  21  06  00

        373 276  \0   | 277  \0 006 271  \0 002 363 244 352   ! 006  \0

0000040  00  be  be  07  38  04  75  0b  83  c6  10  81  fe  fe  07  75

         \0 276 276  \a   8 004   u  \v 203 306 020 201 376 376  \a   u

0000060  f3  eb  16  b4  02  b0  01  bb  00  7c  b2  80  8a  74  01  8b

        363 353 026 264 002 260 001 273  \0   | 262 200 212   t 001 213

0000100  4c  02  cd  13  ea  00  7c  00  00  eb  fe  00  00  00  00  00

          L 002 315 023 352  \0   |  \0  \0 353 376  \0  \0  \0  \0  \0

0000120  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00

         \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0

*

0000660  00  00  00  00  00  00  00  00  7a  a8  02  00  00  00  00  20

         \0  \0  \0  \0  \0  \0  \0  \0   z 250 002  \0  \0  \0  \0    

0000700  21  00  83  fe  ff  ff  00  08  00  00  00  00  2a  01  00  fe

          !  \0 203 376 377 377  \0  \b  \0  \0  \0  \0   * 001  \0 376

0000720  ff  ff  05  fe  ff  ff  fe  df  dc  01  02  78  3f  1b  00  fe

        377 377 005 376 377 377 376 337 334 001 002   x   ? 033  \0 376

0000740  ff  ff  82  fe  ff  ff  00  08  2a  01  00  d0  b2  00  00  00

        377 377 202 376 377 377  \0  \b   * 001  \0 320 262  \0  \0  \0

0000760  00  00  00  00  00  00  00  00  00  00  00  00  00  00  55  aa

         \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0   U 252

The above demonstrates a block of bytes, represented in both hex and the corresponding ASCII character. Where there is no printable ASCII character version of a particular byte, the octal notation is used instead. So here we see an arbitrary block of 512 bytes (a hard-disk Master Boot Record, actually) displayed in at least three different forms. The real-world incarnation of these expressions is a pattern of magnetic domains on a spinning platter, and as long as these don't change, the computer will continue to boot correctly. All our example does is point out that the way we express those patterns of magnetic domains can take whatever form we choose, according to our requirements.

Stay tuned for more on this subject, as we discuss ho

A Bit About Digital, Part II

Comments