Inverted file

Nominal Animal · 04-08-2011, 12:53 PM

Quote:

Originally Posted by Ramurd

I do believe that databases use a likewise approach to sort and index text for fast searching.

Sure, although often there are a number of knobs you need to tune to get the optimum performance.

The reason I suggested a flat file database is that it is very easy to implement, and does not require other services (like MySQL or PostgreSQL) to be installed on the machine. In partial file name matching cases flat files will be faster, because the matching is done with minimal effort and optimum I/O pattern. The structures even allow you to use multithreaded searching quite easily, simply split the text data into roughly equal sizes (at a slash) for each thread, and work in parallel.

The one thing that I dislike about databases on a workstation, is their memory usage. They consume RAM like candy. On 64-bit platforms with my suggestion you can simply mmap() the two files, and let the kernel handle the I/O. You should help the kernel by using madvise(,,MADV_DONTNEED) to tell the kernel which parts of the file you've already read and won't need again, and set a SIGBUS handler to catch the case when the files vanish from under you, but that's about it.

Quote:

Originally Posted by Ramurd

you can see that test.dat is in /opt and /home

Using my suggestion, you'd have /test.dat/ only once in the text file, but twice in the reference table:

Code:

{ .offset = offset of test.dat, .parent = index of opt with parent=0 (root) },
{ .offset = offset of test.dat, .parent = index of home with parent=0 (root) },

If the reference array is sorted by offset, they are all consecutive in the array. (If using a binary search, the search will most likely hit one in the middle, so you'll need to backtrack a bit to find the first match.)

Ramurd · 04-08-2011, 03:10 PM

one point to make clear:

Nominal's solution is the best; I saw the question "do it in SQL" and I got all lazy :-)

There is one thing that'd be really cute though for a database approach (since databases generally use tcp/ip to connect to)
we could also make a third table: machines;

so, you store: unique directory names, unique filenames and unique machines... Now one thing often leads to another... why search and store the files of only one machine? I want to know which machine contains which files, and I want to know it asap. I leave that as an exercise to the OP ;-)

Tinkster · 04-08-2011, 04:48 PM

Quote:

Originally Posted by Nominal Animal

The reason I suggested a flat file database is that it is very easy to implement, and does not require other services (like MySQL or PostgreSQL) to be installed on the machine.

On the mention of Postgres, and the mention of array data types in the
original post: Postgres does have an array data type ;}

Cheers,
Tink

Nominal Animal · 04-08-2011, 07:26 PM

Quote:

Originally Posted by Ramurd

so, you store: unique directory names, unique filenames and unique machines... Now one thing often leads to another... why search and store the files of only one machine?

I just can't resist, sorry..

If you use my suggestion, you could create a third program, a collator, which reads in the databases from each machine, and combines them into one huge one (incrementally or atomically replacing the old one). Then you need a simple service daemon, which responds to queries from TCP/IP, and queries the central database. In this case, you can set the machine name as a "virtual" directory name, so that each path begins with the machine name.

There are some privacy issues, though. Should there be limits on which users can see which files? What if the user has a private directory, say pr0n, which is only accessible to that user. Should other users see the filenames in there or not?

On a single machine you can most easily handle the privacy issues simply by checking if the file or directory is visible to the querying user. See man 2 access, man 2 stat, and man 7 credentials for further info.

When you have a network of multiple machines, the values of uid and gid are useless. They are have meaning only locally on that single machine. For example, if you are using Ubuntu, your UID is quite likely 1000. This means that the privacy measures that are easy to apply on a single machine, are quite difficult to solve in a networked environment.

The only possibility I know of is to create a mapping between uids on each machine (and the same for gids as well); perhaps via user and group names. This is kludgy, and does not always work that well, but I don't know of any better way.

(Okay, there is another: when possible matches are found, have the user log in to the target machine, and do the checks locally. But I don't think this would be quite sane.)