My intention here is to explain some things that are frequently misunderstood in the field of computing. Usually, these will involve something technical, and will be of concern to programmers and others who use computers in ways that are beyond the end-user level.
A Bit About Digital, Part II
Posted 07-08-2012 at 04:47 PM by theNbomr
In Part I, we started to describe how computers use a binary system of storing and manipulating digital data. This is a continuation of that subject.
What we see isn't really what we have.
When we read text on our computer screen, we often see numbers like 0 and 1, just as I've used many times in this text. It may come as a surprise to some that when we see the glyph on the screen that we recognize as a zero, it is actually put there because the underlying byte that represents it is not numerically a zero. In order for a computer terminal or printer to display that little glyph, we actually had to send it the value 3210. This is a byproduct of machinery which was used early on in the computing field to render human readable versions of numeric and or text data.
Terminals and printers are used to produce a visual representation of the bytes we send them. Somewhere in the early history of computing, someone had the good sense to create a widely used standard for the encoding of printable and other bytes that terminals, printers, and some other devices use. That standard defined what we call ASCII encoding. It allows devices like terminals, printers, modems, etc., to be at least somewhat interchangeable.
What the ASCII standard defines is how bytes are used to render all of the various uppercase and lowercase characters, number and punctuation marks. It also defines what a lot of the non-printing bytes do, such as carriage returns and linefeeds, tabs, backspaces, and delete keys. The standard was fairly straight forward, spelling out what bytes 0 to 127 did. For those of you who didn't do the mental gymnastics right away, I'll make it a little easier: the standard spells out the purpose of all of the bytes from 0 to 7F16. Does that make it easier to see that only the bytes with the most significant bit set to 0 were spelled out? This gave rise to the nothing of 7-bit data.
7-bit data made a lot of sense at one time, when communication was done via slow modems. Saving one bit in each byte transmitted was an automatic 12.5% increase in speed. Since there was nothing any device could use the high bit for anyway, there seemed to be little reason to use it. It also provides the final bit of information for the choice of the delete key encoding. One type of ASCII device that was used commonly used was the paper tape writer/reader. It worked by punching a series of holes in a paper tape. Each of the seven holes represented a bit in a byte. If the hole for a particular bit was left not punched, it represented a zero bit, and if it was punched through the paper it was a one bit. Since a bit could never be unpunched, it was impossible to arbitrarily modify the value of a byte. However, the value of a byte could always be changed to 'all bits set to one', by punching every bit out of the tape. This allowed the byte to be treated as ignored, and so you could 'delete' any byte on the tape that way.
Back to the point about the ASCII values not corresponding to their human readable representations. The printable characters in the ASCII table are all of those with values greater than or equal to 32. The upper and lower case characters are arranged in the table so that they differ in only one bit per alphabetic character. This makes it easy to convert between upper and lower case; simply set bit 6 of an upper case alphabetic character's ASCII value to make it lower case. The values 0-31 are assigned to special purposes. Most terminals can generate the first 26 ASCII bytes by typing the 'control' version of each alpha character. For instance ASCII byte 3 is often emitted by holding the control key and pressing 'C'; Ctrl-C for short.
It can be a source of confusion that the printable representation of an eight-bit byte may take as many as three bytes to display. For instance, to display a byte with the bit pattern '01110110' using printable ASCII characters will require the three ASCII bytes '2116', 2216', & '2316'. These are the characters '1'. '2' and '3'. What's more, it can be tricky to make the calculation that converts the raw value to the string of printable characters that lets us visualize the value, at least in very low level languages such as assembler. In contrast, the process to convert from the raw value to a printable string is simple to do using the hexadecimal or even octal radices.
Common Language
In computing there is a lexicon that is commonly used to describe the nature of digital data, especially in the numeric sense. Some of the commonly used expressions are technically inaccurate, but it is nevertheless the standard way to describe certain situations. To the uninitiated, this language can be misleading, and I want to explain a bit about how the language is used and what it is used to describe.
One of the commonly confusing uses relates to the word 'binary'. When we refer to 'binary' data, it is generally used to describe a data set that is not generally 'human readable'. It contains a varied mix of all data bytes, both printable and non-printable. Usually this describes something like an object file of compiled code. While there may be some embedded strings of printable data, it is not of the sort that could be correctly modified by using something like a text editor such as vi, or in Windows, notepad. The bytes are meaningful to certain kinds of programs, and are almost always created by programs with certain purposes. We sometimes want to visualize these kinds of data, and for that we use tools that are capable of translating the bytes or multi-byte words into some orderly arrangement of human-readable text. In Linux, one such general purpose tool is the program 'od'.
We can use od to display any data file in a great variety of formats. It is probably instructive to take a few minutes to play around with od to display some 'binary' file in a few interesting ways. Doing so may help to drive home the distinction between the value of a byte versus the manner in which we visualize it. od allows us to display a data file in multiple formats, side-by-side. This can make it clear how a bytes value may have different meanings, depending on the context.
The above demonstrates a block of bytes, represented in both hex and the corresponding ASCII character. Where there is no printable ASCII character version of a particular byte, the octal notation is used instead. So here we see an arbitrary block of 512 bytes (a hard-disk Master Boot Record, actually) displayed in at least three different forms. The real-world incarnation of these expressions is a pattern of magnetic domains on a spinning platter, and as long as these don't change, the computer will continue to boot correctly. All our example does is point out that the way we express those patterns of magnetic domains can take whatever form we choose, according to our requirements.
Stay tuned for more on this subject, as we discuss ho
What we see isn't really what we have.
When we read text on our computer screen, we often see numbers like 0 and 1, just as I've used many times in this text. It may come as a surprise to some that when we see the glyph on the screen that we recognize as a zero, it is actually put there because the underlying byte that represents it is not numerically a zero. In order for a computer terminal or printer to display that little glyph, we actually had to send it the value 3210. This is a byproduct of machinery which was used early on in the computing field to render human readable versions of numeric and or text data.
Terminals and printers are used to produce a visual representation of the bytes we send them. Somewhere in the early history of computing, someone had the good sense to create a widely used standard for the encoding of printable and other bytes that terminals, printers, and some other devices use. That standard defined what we call ASCII encoding. It allows devices like terminals, printers, modems, etc., to be at least somewhat interchangeable.
What the ASCII standard defines is how bytes are used to render all of the various uppercase and lowercase characters, number and punctuation marks. It also defines what a lot of the non-printing bytes do, such as carriage returns and linefeeds, tabs, backspaces, and delete keys. The standard was fairly straight forward, spelling out what bytes 0 to 127 did. For those of you who didn't do the mental gymnastics right away, I'll make it a little easier: the standard spells out the purpose of all of the bytes from 0 to 7F16. Does that make it easier to see that only the bytes with the most significant bit set to 0 were spelled out? This gave rise to the nothing of 7-bit data.
7-bit data made a lot of sense at one time, when communication was done via slow modems. Saving one bit in each byte transmitted was an automatic 12.5% increase in speed. Since there was nothing any device could use the high bit for anyway, there seemed to be little reason to use it. It also provides the final bit of information for the choice of the delete key encoding. One type of ASCII device that was used commonly used was the paper tape writer/reader. It worked by punching a series of holes in a paper tape. Each of the seven holes represented a bit in a byte. If the hole for a particular bit was left not punched, it represented a zero bit, and if it was punched through the paper it was a one bit. Since a bit could never be unpunched, it was impossible to arbitrarily modify the value of a byte. However, the value of a byte could always be changed to 'all bits set to one', by punching every bit out of the tape. This allowed the byte to be treated as ignored, and so you could 'delete' any byte on the tape that way.
Back to the point about the ASCII values not corresponding to their human readable representations. The printable characters in the ASCII table are all of those with values greater than or equal to 32. The upper and lower case characters are arranged in the table so that they differ in only one bit per alphabetic character. This makes it easy to convert between upper and lower case; simply set bit 6 of an upper case alphabetic character's ASCII value to make it lower case. The values 0-31 are assigned to special purposes. Most terminals can generate the first 26 ASCII bytes by typing the 'control' version of each alpha character. For instance ASCII byte 3 is often emitted by holding the control key and pressing 'C'; Ctrl-C for short.
It can be a source of confusion that the printable representation of an eight-bit byte may take as many as three bytes to display. For instance, to display a byte with the bit pattern '01110110' using printable ASCII characters will require the three ASCII bytes '2116', 2216', & '2316'. These are the characters '1'. '2' and '3'. What's more, it can be tricky to make the calculation that converts the raw value to the string of printable characters that lets us visualize the value, at least in very low level languages such as assembler. In contrast, the process to convert from the raw value to a printable string is simple to do using the hexadecimal or even octal radices.
Common Language
In computing there is a lexicon that is commonly used to describe the nature of digital data, especially in the numeric sense. Some of the commonly used expressions are technically inaccurate, but it is nevertheless the standard way to describe certain situations. To the uninitiated, this language can be misleading, and I want to explain a bit about how the language is used and what it is used to describe.
One of the commonly confusing uses relates to the word 'binary'. When we refer to 'binary' data, it is generally used to describe a data set that is not generally 'human readable'. It contains a varied mix of all data bytes, both printable and non-printable. Usually this describes something like an object file of compiled code. While there may be some embedded strings of printable data, it is not of the sort that could be correctly modified by using something like a text editor such as vi, or in Windows, notepad. The bytes are meaningful to certain kinds of programs, and are almost always created by programs with certain purposes. We sometimes want to visualize these kinds of data, and for that we use tools that are capable of translating the bytes or multi-byte words into some orderly arrangement of human-readable text. In Linux, one such general purpose tool is the program 'od'.
We can use od to display any data file in a great variety of formats. It is probably instructive to take a few minutes to play around with od to display some 'binary' file in a few interesting ways. Doing so may help to drive home the distinction between the value of a byte versus the manner in which we visualize it. od allows us to display a data file in multiple formats, side-by-side. This can make it clear how a bytes value may have different meanings, depending on the context.
Code:
dd if=/dev/sda count=1 bs=512 | od -tx1c 0000000 fa b8 00 10 8e d0 bc 00 b0 b8 00 00 8e d8 8e c0 372 270 \0 020 216 320 274 \0 260 270 \0 \0 216 330 216 300 0000020 fb be 00 7c bf 00 06 b9 00 02 f3 a4 ea 21 06 00 373 276 \0 | 277 \0 006 271 \0 002 363 244 352 ! 006 \0 0000040 00 be be 07 38 04 75 0b 83 c6 10 81 fe fe 07 75 \0 276 276 \a 8 004 u \v 203 306 020 201 376 376 \a u 0000060 f3 eb 16 b4 02 b0 01 bb 00 7c b2 80 8a 74 01 8b 363 353 026 264 002 260 001 273 \0 | 262 200 212 t 001 213 0000100 4c 02 cd 13 ea 00 7c 00 00 eb fe 00 00 00 00 00 L 002 315 023 352 \0 | \0 \0 353 376 \0 \0 \0 \0 \0 0000120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 0000660 00 00 00 00 00 00 00 00 7a a8 02 00 00 00 00 20 \0 \0 \0 \0 \0 \0 \0 \0 z 250 002 \0 \0 \0 \0 0000700 21 00 83 fe ff ff 00 08 00 00 00 00 2a 01 00 fe ! \0 203 376 377 377 \0 \b \0 \0 \0 \0 * 001 \0 376 0000720 ff ff 05 fe ff ff fe df dc 01 02 78 3f 1b 00 fe 377 377 005 376 377 377 376 337 334 001 002 x ? 033 \0 376 0000740 ff ff 82 fe ff ff 00 08 2a 01 00 d0 b2 00 00 00 377 377 202 376 377 377 \0 \b * 001 \0 320 262 \0 \0 \0 0000760 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 U 252
Stay tuned for more on this subject, as we discuss ho
Total Comments 0