There is a difference between a full table scan with string compares and an index; The analogy about reading the book is of course not 100% analogue to how humans read and do things; What I was referring to is the scanning many times hence and forth.
What you must realize is that a string compare is a hard task for a computer, in this sense that computers can only calculate. So to go a bit into detail: a string is an array of characters, which are technically speaking just 8bit integers (or 1 byte); so in order to see if strings are equal, a computer has to compare those in a loop:
for each letter in the string:
check if letter equals the letter in the same location on the other string
(there are shortcuts, which are taken; not going into that detail, as it will dive too deep at this point)
For "string equals other string" the shortcuts are fairly easy and fast enough, but "string contains ..." this is much harder; there are far fewer shortcuts
Since computers are not particulary good at searching in text, but are very good at processing numbers, binary trees were brought forth;
Now, what happens -generally- when you perform a SQL query:
Code:
SELECT field1, field2 FROM table1 WHERE field3 = value
If there is an index, the
index is searched: this will lead to the numbers of the records which are to be presented; The whole record is read, but only the fields of that record will be displayed. In this read operation, an amount of data equal to n times the blocksize is read (n being an arbitrary number); This makes that a table scan can be done fairly fast: each read operation is going to lead to a great amount of records, without having to read those records themselves. The index-read leads to a certain amount of records, and this can speed up things, as the database knows how big each record is and can thus
jump to those records; which is way faster than a complete read operation.
Indexes are not always going to speed up your operations; on small tables, they actually make things slower.
Also, indexes make write operations slower at all times; they're good for retrieval operations.
In analogy to bookreading: In books you also have indexes; I for one, never read the index of a very small book; I can browse to the required data faster if I have some idea what the book is about; However, as a human I can read text fast, which a computer cannot (all in relativity); However, in a big book (take a cookbook with thousands of recipes, and I want to find the recipe for goulash I will take the index and see how far I can skip ahead to find that one; It speeds up my search operation.
However, if there's a small cookbook "5 ways to boil an egg", I'm not going to use the index to find the recipe for a hard-boiled egg. (usually, smart databases will neither; the query optimizer takes care of such things)
As you might notice here, right now: the entire table is never read into memory; it is scanned and parts of the table may be in memory for a while, but they do not "reside" there; what would be in memory would be a temporary copy of the actual data in the database. Because: what happens when there's a powerfailure during the query?
So, what modern RDBMs do, is store the current data to a "before image", perform the query, and when it completes successfully the before image can be destroyed. If there is a powerfailure during the query, the database can always be restored to the original state by performing a roll-back if the before image write has completed; if not: the data is still in the original state; If the powerfail happened during the query, the rollback can place the data back to the original state. The query is lost, but that's not as bad as a database that is in an inconsistent state.
To prevent data-loss, the "after-image" was invented; how the data would look after the data has been committed; Together with the before image, you can do a roll-back and a roll-forward to have the database in a consistent state at all times
and not losing the query. Keeping the data in this consistent state has a cost, all those write operations are costly. But you also see that the database itself, and thus the tables are and must be on the disk.
Now for cross referencing:
If you want to get the idiosyncrasies (I hope I spelled that right) of all the persons, you would do "SELECT * FROM idio_table"; and this would probably be a single read operation. However, if you want to split this up in 50 tables the data is most likely spread out, and thus would cost far more (about 50) read operations; And even with that: how would your query look like
SELECT * FROM people_table INNER JOIN idio1_table ON people_table.id = idio1_table.id INNER JOIN idio2_table ON people_table.id = idio2_table.id ... etc
or
SELECT people_table.name FROM people_table,nose_pickers WHERE people_table.id IN nose_pickers.id; ... etc...
it will take you ages to write the right queries, and with that you also have to maintain as many indexes (to make sure you don't get duplicates) as well as alot of cross referencing; your updates will cost you several read/write operations, whereas if you kept it in 1 table it would be a single read/write operation.
I think this is not too hard to grasp; So: yes, you can write your own program that would manage the data in your way, but I think it'd be slower than the most bloated of databases, where that bloated database provides also all the tools to keep your data consistent at all times.