Files, directories and links

Posted 12-26-2023 at 08:09 AM by hazel
Updated 02-20-2024 at 07:29 AM by hazel

Here's the dirty secret: directories/folders on a computer don't actually contain any files at all. Nary a one! In fact, directories are themselves files and therefore can't contain anything but data. They do contain rather unusual data compared with other files, and there are all kinds of safety checks in standard file-handling utilities that stop you from looking inside them as you would with any other type of file, but files they are all the same.

I think we have been seriously misled by the universal desktop convention of representing directories as folders. This may be helpful to complete newbies who simply want to be able to access files and use the data contained in them without having to learn a whole lot of stuff about filesystem organisation, but it leads to a lot of people having quite the wrong idea about how directories actually work. After all, the traditional manilla folders that were used in offices when I was young did contain files. The files consisted of sheets clipped or stapled together (originally they were tied together with a thread, fil in French) and all the files in a given folder would have something in common such as date or subject.

Point for consideration: a manila folder could not contain another folder, let alone multiple ones. There simply wasn't enough room. And this should be a warning that there is something very odd about the folder metaphor as applied to computers.

I was reminded about this when reading contributions to a thread that I started about using the rsync program as a simple backup tool. One very interesting suggestion was to maintain a sequence of dumps using rsync's --link-dest option, which creates hard links to an earlier dump for those files that have not changed (by default rsync would simply ignore these). Each dump would therefore be a complete directory tree although no unchanged file would actually be copied repeatedly. Naturally I wondered what would happen to those directories full of hard links when their apparent target, the original dumped directory, was removed in due course. Would the "complete" copy need to be redone periodically?

No, not at all, as I soon realised. Copying actual file data to a safe place on another partition is logically separate from indexing their new positions in a rewritten directory tree, although any program that copies files (for example cp) will necessarily have to do both. Every directory entry actually links a filename to a file located elsewhere. Strictly speaking the link is to the file's inode(see below) which contains everything the kernel needs to know about the file in order to access it, including its physical location on the partition. Thus in the suggested dump schema, the "master" would actually be no different from any of the later dumps. Each one of them would just be a collection of names and hard links. So I was able to reassure another contributor who had raised the same question as I had been considering.

All Unix variants deal with files in the same way. Every file consists of one or more blocks which contain the data (the information in the file) and an inode, a small structure which contains the metadata (information about the file). This includes its size, creation date, date of last modification, owner and access rights. The inode also contains the addresses of all the data blocks. So once the kernel has the inode, it effectively has access to the entire file.

One thing that the inode doesn't contain is the filename. Filenames are stored in directories, not inodes. In fact, directories are nothing more than maps for converting filenames to inodes. That of course is why they are called directories. The correct analogy is not with a manilla folder containing files, but with a telephone directory, which does not contain telephones but telephone numbers and the names of the people who own them. Similarly a file directory contains inode numbers and the names of the files that own them.

More than one filename can be linked to the same inode. These can be different names in the same directory: in Linux g++ (the C++ compiler) is actually hard linked to gcc (the C compiler). These are the same program stored in a single file, but the program is coded to behave differently depending on which name you invoke it by. Or the extra link(s) can be to filenames (perhaps identical filenames) in different directories. The important thing to remember is that, although you must provide a relative or absolute filename when creating a link, the link is made to the file itself (that is to its inode) and not to the filename. That is the difference between a hard link and a soft one (see below).

One very important field in the inode structure is the hard link count. A newly created file will have a link count of 1 (to its name). If extra hard links are created, the count will go up. And these links need not be just to extra filenames: hard links are also created when a file is opened to access its data, or when a library is dynamically opened to access its functions. Of course these extra links disappear when the access terminates.

A file's data blocks cannot be deleted as long as the link count in its inode is non-zero. The filename can be deleted from the directory in which it is stored, but the inode and data blocks remain operational as long as they are in use. Only when the link count drops to zero will the inode and blocks be recycled.

This explains how Linux can easily update software while the system is running. There is seldom a need to reboot after an update as is normal in Windows. Some updated daemons may profit from a restart, but the transfer of libraries from old to new versions is completely smooth. What hapens is this:

1) The new library is copied over and the symbolic link that programs actually use to make their first connection to a library is switched to point to the new version.
2) The old version is deleted. That means that the name is deleted from the directory and the library file's link count is decremented by 1. If the count is now zero, the library is clearly not in use and the actual file can be scrapped and its inode and blocks recycled.
3) If the old library's link count is not zero, the inode and blocks remain as is. Any programs dynamically linked to that library will continue to use it until they terminate, which may not be until shutdown. Newly launched programs however will be forced to use the new version as the old version's name has disappeared from the directory.

Linux newbies are usually aware that "." is a synonym for the current working directory and ".." is a synonym for the parent directory. But many people think that this is somehow built into Linux/Unix. It isn't. "." and ".." are perfectly normal filenames (they just look a bit odd) and the equivalence to directory names is due to simple hard linking. When you create a new directory with mkdir or some graphical equivalent, these hard links are automatically inserted into it. This has an interesting result when it comes to searching directories for information: a directory's hard link count will always be 2+N, where N is the number of subdirectories. Each subdirectory will contain a hard link called ".." and the other two instances are the directory name and the local "." file. So search programs like find can recognise at once how many subdirectories there should be. When that number has been checked off, it will no longer be necessary to test whether the remaining daughter files are directories. They won't be! This greatly speeds up tree searches but it requires a complete ban on users creating additional hard links to any directory.

Though hard linking, especially in its dynamic form, is pretty universal in Linux, most users are unaware of how prevalent it is. When creating short cuts to files, one almost always uses symbolic or soft links, which serve a similar purpose but have a quite different structure and are much more flexible. A symbolic link (made with ln -s) is a special file containing a reference to the name of another file as listed in some directory, not necessarily the directory that contains the link. The filename may be a full absolute or relative pathname rather than a local name. The target can itself be a directory; soft links to directories are legal. The target file need not even be on the same partition. Hard links can only be to a file on the same partition because each partition has its own separate inode stack.

A pair of hard links is symmetrical: each links directly to the target file's inode. If either is deleted, the other will still work, and so the inode and the blocks will still be accessible. But a soft link is purely to the name of the file, which in turn is linked to the file's inode. If that hard-linked name is deleted, the soft link becomes invalid. It doesn't disappear, because it is a file in its own right, but it no longer points to anything real. Most file managers will display such "broken links" in a different colour to show that something has gone wrong. Typically the names of valid symbolic links are shown in cyan and invalid ones are red.

Note:If you want to see what a symbolic link actually points to, you can use

Code:

readlink filename

Code:

ls -l filename

Files, directories and links

Comments