Combine symlinks into one file?

iantresman · 11-21-2011, 06:51 PM

Is it possible to combine several symlinks into one object? It just seems inelegant to have perhaps a dozen symlinks in a directory, when a single "batch file" would do?

Nominal Animal · 11-21-2011, 11:39 PM

I do not understand what you mean.

Combining different symlinks into one entity makes as much sense as combining multiple entries in a phonebook into one. I cannot even fathom how it would work.

In Unix-like systems, files are just names, or directory entries, that point to inodes: the actual data blobs with ownership, access mode, and other metadata. The relationship between the two is called a hardlink. Each inode can have multiple directory entries (hardlinks) to it.

Symlinks are files (directory entries) whose data is special, and names the (relative) path to the actual file (or next symlink). When accessing a symlink, the kernel does the redirection automatically. Userspace does have a couple of special functions (lstat, readlink) that you can check if a directory entry is a symlink or not, and a couple more to manipulate symlinks (symlink, symlinkat, unlink) but otherwise the symlinks cannot be manipulated.

Although it varies a bit between filesystems how the 'data' part of a symlink is stored, they are extremely efficient. In current Linux filesystems the directory entry management and path walking uses very advanced algorithms; ones that are at par, or better, than those used in relational databases. Symlinks are a native part of this mechanism, and the cost (in memory, storage, and CPU time used) in using them is pretty close to the theoretical minimum (unless you are willing to compromise the performance of the other filesystem features).

In other words, I have no idea what prompted the idea for you. It makes no sense, and it certainly would yield no performance benefits. Have you perhaps seen something like that in Windows? Just because most people use Windows and not Linux on desktop PC's, it does not mean Windows is better. For example, the rock-solid ext4 filesystem I prefer to use beats NTFS in benchmarks.

iantresman · 11-22-2011, 02:46 AM

I think I understand how it works physically. I might have half a dozen separate symlinks:

file1.htm (to new redirected file)
file2.htm "
file3.htm "
file4.htm "
file5.htm "
file6.htm "

But I can see no reason why I can't have a symlink "package" which contains exactly the same information, which the operating system looks at. For example, I could probably simulate my set of symlinks using mod_rewrite in my htaccess file, which makes my directory listing tidier,:

htaccess (single file which redirects file?.htm to new files)

Nominal Animal · 11-22-2011, 08:40 PM

Ah, now I understand: you're wondering if you could have a symlink named using a regular expression, that works for many files, not just one. For example, file(.*).htm -> actual-file-$1 in your case, right?

The kernel does not use glob patterns or regular expressions. It only uses exact filenames. Glob patterns are only used by shells and userspace libraries; the kernel has no notion of them. Actually, even file names are opaque cookies to the kernel; as long as a name does not contain a slash '/', and is terminated by a NUL '\0', and is not overly long, the kernel is happy. It does not even care about the character set used for file names; it is all just bytes to it.

If the kernel were to look for such pattern-based symlinks, it would have to burn quite a lot more CPU time. Processing glob patterns is not trivial, and processing regular expressions takes significant CPU time, compared to the amount needed by current filesystem accesses in Linux. Simply put, they would be quite slow, and need quite a bit of code, to implement.

However, such "virtual symlinks" can be implemented, if desired.

One could add a kernel module that, in case a file was not found, checks the "virtual symlinks" in the directory. These "virtual symlinks" could be specified in a centralized file, and/or a file stored in each directory. Basically it would override certain syscalls, so that if they fail (to find the target file), the "virtual symlinks" are checked instead. This would have minimal impact on normal accesses, but it would be a pretty big module.

One could also use the LD_PRELOAD environment variable to preload a library that overrides the relevant syscalls, and handles the "virtual symlinks". This would work just like the kernel module, except within the userspace processes themselves. You would not need to use the library for all processes, only the ones that should "see" the virtual symlinks.

One might also be able to use the fanotify features available in Linux 2.6.37 and later kernels, to catch the accesses to nonexistent files in userspace. I'm not familiar with this side of fanotify, so I'm not certain it is possible.

The above methods would provide "virtual symlinks" without anything showing up in directory listings.

There is a much simpler alternative, however: You can generate symlinks automatically, based on regular expressions (stored in some configuration file, and/or in each directory), whenever files are created, deleted or renamed.

In this case, one would use either inotify (asynchronous, the events are notified only after they have already occurred) or fanotify (synchronous, the process that caused the event can be "paused" until the event has been processed) to monitor file events, and regenerate the symlinks in each affected directory whenever necessary.

The kernel and userspace ignore symlink mode and owner, so you'd use a dedicated mode and/or owner to mark these generated symlinks. (That way you can still create normal symlinks in a directory, and they won't affect the autogenerated ones.)

Using the inotify-tools package, specifically inotifywait command, this symlink regenerator can be written as a simple Bash script. The delay between file name changes and corresponding symlinks is quite short, certainly less than a second. One could use awk to generate the list of desired symlinks in each directory, so you could define the patterns using regular expressions. (With fanotify , accesses can be delayed until the symlinks are up to date with all changes, but it would need a recent kernel to work, and the code would have to be written in C to minimize the delays.)

Yet, the question remains: Why?

It is rare to have batteries of symlinks, that have a common defining rule. Batteries of symlinks are not rare, for example many distributions use /etc/alternatives to define which command variant is used when using a generic name -- for example, awk may really be gawk, mawk, or nawk --, but there is no general rule on how they are defined. (In fact, most distributions have tools to select between variants, and these tools modify individual links in /etc/alternatives.)

If you can convince me you have a real use case, a real need, for pattern-defined symlinks, I will write a working inotify-based Bash/gawk daemon to manage them. If not for anything else, for testing and as an example.

I myself have managed all kinds of Linux desktops, servers, and even clusters, for well over a decade now, and I've never encountered any need for something like this. (Aesthetics are based on previous experience, and with regards to Linux/Unix machines, the notions tend to change as you use the systems in new ways, so I must consider "it'd be prettier" an insufficient reason.)

Finally, I'd like to really encourage you to continue looking at Linux as something you can change to better suit your needs. It will pay off in the long run; it has for me. The main reason I stopped using Windows systems several years ago, was the prevalent view that there is one way to do things, the way the software companies suggest. Instead of changing the software to conform to users' optimal workflow, users are forced to adopt a workflow the software allows. The users' choice is limited to picking the software they most prefer. The most prevalent complaint about Linux distributions I keep seeing is that the default workflows are not as good as corresponding commercial software. To me, that is irrelevant, as long as I can edit and tweak it for my (actually my users') most efficient workflow. (I'm just waiting for people to realize that banding together user interface specialists, business logic experts, ergonomics experts, and true Linux experts able to adapt software to specified needs, would allow such a team to design and implement much more efficient workflows than currently available, tailored to the needs of individual companies or departments.)

iantresman · 11-23-2011, 04:31 AM

Many thanks for your very comprehensive reply, certainly much to think about. Your last comment is most important, "looking at Linux as something you can change".

Consequently, I need to workout the best solution that I can manage, as nice as the "virtual symlinks" sound. I was ideally hoping that there was a simply "command" or htacess solution that I was unaware of.

Thanks again for highlighting the options, much appreciated. Now I have a lot of reading to do!

Nominal Animal · 11-23-2011, 07:34 AM

If you only need the symlinks for Apache, then mod_rewrite should work well.

There are a couple of things to note:

You need to enable URL rewriting using
RewriteEngine On
and have option FollowSymlinks or FollowSymlinksIfOwnerMatch enabled for the directory (preferably in a parent directory), otherwise the RewriteRules are ignored.
To use mod_rewrite in a .htaccess file, you need to
AllowOverride FileInfo
at least, in the Apache configuration files relevant to this directory tree.
Obviously All will also work.
When using a .htaccess file, or a <Directory>, you need to specify the URL prefix to the directory using
RewriteBase /URL/prefix
Apache automatically removes the base from any URLs, as well as all slashes that follow the prefix, before handing the rest to RewriteRule. (The bit that RewriteRule gets never starts with a slash when RewriteBase is used.) If the rule is applied, the prefix is prepended to the result automatically.
If you omit the directive, Apache will "guess", using filesystem path as the default value.
The settings in the .htaccess file are only considered if the URL is relevant to the directory.
Apache "walks" the URL in steps, one path segment at a time. If this walk passes through this directory, then these settings are applied.
If you want globally applied rules, checked before any path walking is done, you need to add those rules into the VirtualHost. If you define no VirtualHosts, then you add it to the root configuration. In this case you do not use RewriteBase, and the URL given to RewriteRule most likely begins with a slash.
Apache reads and compiles there regular expressions at startup (and whenever told to reload configuration). This means these Rewrite rules are much more efficient than those in .htaccess files (since the latter need to be compiled basically every time).

Here is an example .htaccess file, which redirects all files in that directory (but not in subdirectories) to index.cgi, adding the desired file name as PATH_INFO variable, and keeping all GET (and POST, obviously) query parameters intact:

Code:

Options        +FollowSymlinks
RewriteEngine  On
RewriteBase    /this/url/prefix
RewriteRule    index\.cgi$ - [L]
RewriteRule    index\.cgi/ - [L]
RewriteRule    ^([^/]*)$    index.cgi/$1 [QSA,L]

The same when defined inside a <VirtualHost> directive:

Code:

Options        +FollowSymlinks
RewriteEngine  On
RewriteRule    ^/+this/+url/+prefix/+index\.cgi$ - [L]
RewriteRule    ^/+this/+url/+prefix/+index\.cgi/ - [L]
RewriteRule    ^/+this/+url/+prefix/+([^/]*)$    /this/url/prefix/index.cgi/$1 [QSA,L]

The difference in efficiency is insignificant on small loads, especially if you use a scripting language on the redirected to pages.

iantresman · 11-23-2011, 08:07 AM

Gosh, you're very good. I've been computing for as long as I can remember, and some of this stuff has never clicked. Another question if I may. My understanding of mod_rewrite was that it works only on the URL. Can I bring in material outside the domain space? For example, if I have:

--directory
| +---files
|
|--domain1.com
|--domain2.com

I know that I can use PHP to "include" the stuff from directory/files into the space of domain1.com and domain2.com. Can htaccess also do something like that? For example, if my Browser were to point to domain1.com/files, it would show the files from the internal directory/files?

iantresman · 11-23-2011, 10:07 AM

I think I'm getting close, have now discovered the Apache mod_alias directive, and that I may be able to duplicate it in htaccess with mod_rewrite.

iantresman · 11-23-2011, 05:46 PM

I can almost do what I want. The following rewrites, letting me access files in another location as expected:

PHP Code:



RewriteEngine On 
RewriteRule sub/ /var/www/vhosts/mydomain.com/images [PT]

But it won't work outside of my webspace, eg:

PHP Code:



RewriteRule sub/ /var/www/vhosts/library/d6/subdir/ [PT]

Any suggestions?

Nominal Animal · 11-23-2011, 11:29 PM

Quote:

Originally Posted by iantresman

But it won't work outside of my webspace, eg:

PHP Code:



RewriteRule sub/ /var/www/vhosts/library/d6/subdir/ [PT]

You need to allow access to the rewritten-to directory. If you use a Debian-based Linux distribution like Debian, Ubuntu, or Mint, you can add /etc/apache2/conf.d/library containing

Code:

<Directory /var/www/vhosts/library/d6/subdir/>
    AllowOverride   None
    Options         FollowSymlinks
    Order           Allow,Deny
    Allow           From all
</Directory>

In other distributions, and in general, you can add the above snippet to the Apache base configuration (in which case it is inherited by all VirtualHosts), or just the specific VirtualHosts you want.

Note that you probably also want to use form

Code:

RewriteEngine On
RewriteRule sub/+(.*)$  /var/www/vhosts/library/d6/subdir/$1 [L]

The [L] (Last, ignore all other RewriteRules if this applies) flag is more efficient than [PT] (passthrough; if this applies, restart mapping logic from the start, so all RewriteRules and Aliases are considered again from scratch, but using the result of the rewrite).

However, in the VirtualHost configuration,

Code:

Alias /URL/prefix/to/sub /var/www/vhosts/library/d6/subdir/
<Directory /var/www/vhosts/library/d6/subdir/>
    AllowOverride   None
    Options         FollowSymlinks
    Order           Allow,Deny
    Allow           From all
</Directory>

does the very same thing, but much more efficiently. This latter form is however only processed after all rewrites, so you sometimes have to use the RewriteRule instead. (In the VirtualHost configuration, the full RewriteRule equivalent to the Alias is RewriteRule ^/+URL/+prefix/+to/+sub/+(.*)$ /var/www/vhosts/library/d6/subdir/$1 [L])

Keep in mind the order of processing:

Quote:

Aliases and Redirects occuring in different contexts are processed like other directives according to standard merging rules. But when multiple Aliases or Redirects occur in the same context (for example, in the same <VirtualHost> section) they are processed in a particular order.

First, all Redirects are processed before Aliases are processed, and therefore a request that matches a Redirect or RedirectMatch will never have Aliases applied. Second, the Aliases and Redirects are processed in the order they appear in the configuration files, with the first match taking precedence.

For this reason, when two or more of these directives apply to the same sub-path, you must list the most specific path first in order for all the directives to have an effect.

iantresman · 11-24-2011, 08:12 AM

Nominal Animal, thanks once again for all the help. I'm not sure whether I can implement all the suggestions (depends how much access to my server I can get), but I'll certainly give it a try. I've managed to learn more in two days from your good self, than several weeks of internet research and other so-called experts. Fantastic!