LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-08-2011, 12:53 PM   #16
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948

Quote:
Originally Posted by Ramurd View Post
I do believe that databases use a likewise approach to sort and index text for fast searching.
Sure, although often there are a number of knobs you need to tune to get the optimum performance.

The reason I suggested a flat file database is that it is very easy to implement, and does not require other services (like MySQL or PostgreSQL) to be installed on the machine. In partial file name matching cases flat files will be faster, because the matching is done with minimal effort and optimum I/O pattern. The structures even allow you to use multithreaded searching quite easily, simply split the text data into roughly equal sizes (at a slash) for each thread, and work in parallel.

The one thing that I dislike about databases on a workstation, is their memory usage. They consume RAM like candy. On 64-bit platforms with my suggestion you can simply mmap() the two files, and let the kernel handle the I/O. You should help the kernel by using madvise(,,MADV_DONTNEED) to tell the kernel which parts of the file you've already read and won't need again, and set a SIGBUS handler to catch the case when the files vanish from under you, but that's about it.

Quote:
Originally Posted by Ramurd View Post
you can see that test.dat is in /opt and /home
Using my suggestion, you'd have /test.dat/ only once in the text file, but twice in the reference table:
Code:
{ .offset = offset of test.dat, .parent = index of opt with parent=0 (root) },
{ .offset = offset of test.dat, .parent = index of home with parent=0 (root) },
If the reference array is sorted by offset, they are all consecutive in the array. (If using a binary search, the search will most likely hit one in the middle, so you'll need to backtrack a bit to find the first match.)
 
Old 04-08-2011, 03:10 PM   #17
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 703

Rep: Reputation: 111Reputation: 111
one point to make clear:

Nominal's solution is the best; I saw the question "do it in SQL" and I got all lazy :-)

There is one thing that'd be really cute though for a database approach (since databases generally use tcp/ip to connect to)
we could also make a third table: machines;

so, you store: unique directory names, unique filenames and unique machines... Now one thing often leads to another... why search and store the files of only one machine? I want to know which machine contains which files, and I want to know it asap. I leave that as an exercise to the OP ;-)
 
Old 04-08-2011, 04:48 PM   #18
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by Nominal Animal View Post

The reason I suggested a flat file database is that it is very easy to implement, and does not require other services (like MySQL or PostgreSQL) to be installed on the machine.
On the mention of Postgres, and the mention of array data types in the
original post: Postgres does have an array data type ;}




Cheers,
Tink
 
Old 04-08-2011, 07:26 PM   #19
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by Ramurd View Post
so, you store: unique directory names, unique filenames and unique machines... Now one thing often leads to another... why search and store the files of only one machine?
I just can't resist, sorry..

If you use my suggestion, you could create a third program, a collator, which reads in the databases from each machine, and combines them into one huge one (incrementally or atomically replacing the old one). Then you need a simple service daemon, which responds to queries from TCP/IP, and queries the central database. In this case, you can set the machine name as a "virtual" directory name, so that each path begins with the machine name.

There are some privacy issues, though. Should there be limits on which users can see which files? What if the user has a private directory, say pr0n, which is only accessible to that user. Should other users see the filenames in there or not?

On a single machine you can most easily handle the privacy issues simply by checking if the file or directory is visible to the querying user. See man 2 access, man 2 stat, and man 7 credentials for further info.

When you have a network of multiple machines, the values of uid and gid are useless. They are have meaning only locally on that single machine. For example, if you are using Ubuntu, your UID is quite likely 1000. This means that the privacy measures that are easy to apply on a single machine, are quite difficult to solve in a networked environment.

The only possibility I know of is to create a mapping between uids on each machine (and the same for gids as well); perhaps via user and group names. This is kludgy, and does not always work that well, but I don't know of any better way.

(Okay, there is another: when possible matches are found, have the user log in to the target machine, and do the checks locally. But I don't think this would be quite sane.)
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is the Linux GMT/UTC definition inverted? jjinno Linux - Software 11 01-05-2010 07:15 PM
Inverted video in Mandriva 2009.1 Richie55 Mandriva 0 05-05-2009 02:39 AM
USB Mouse with inverted axis Teo-sama Linux - Hardware 1 02-04-2009 04:39 PM
Inverted colors after 4 bit color? novalinux DamnSmallLinux 0 06-03-2006 08:39 AM
inverted x server colors. xconspirisist Linux - Software 1 12-03-2005 09:04 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:27 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration