ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am writing a small search program for my class. I have decided to use indexing for my program. Ive researched online about indexing and how search engines do it. If im gonno do that I need to create inverted files to associate files to numbers ( numbers being the index of my paths ) . Now I was wondering what would be the best way to create an inverted file ? I was going to create sql tables using mysql api in C but then again there is no array data type or vectors to store few numbers in a single column in mysql and it is not advised to use Enum or SET .
Any help would be very appreciated .
Thanks a lot.
I am writing a small search program for my class. I have decided to use indexing for my program. Ive researched online about indexing and how search engines do it. If im gonno do that I need to create inverted files to associate files to numbers ( numbers being the index of my paths ) . Now I was wondering what would be the best way to create an inverted file ? I was going to create sql tables using mysql api in C but then again there is no array data type or vectors to store few numbers in a single column in mysql and it is not advised to use Enum or SET .
Any help would be very appreciated .
Thanks a lot.
What do you mean by "inverted" file ? Please give an example.
Also, what programming language you are using, I would recommend bash shell
Inverted files are used for indexing in searching.
Lets say I have these directories in my root :
/ ---> bin sbin var .....
So I create a file that gives numbers to each of these :
1 . bin
2 . sbin
3 . var
Then I start creating an inverted text file for everything on my hard drive .
Lets say I have a document that is in /home/usr/ . This path has the value 6 in the file above , so in my inverted file it would look like :
docuement { 6 }
If you find the same document in another path , you add the number of the path to that list ( vector ).
You can use any programming language that supports file IO. A language that provides support for list would be a boon. But it all comes down to what you know given that the requirements could be met by so many languages.
I am not sure I follow the example given (could just be me). You say you have 3 directories under root (for the example), but you then add to the
example the path /home/usr that has a document yet home is not listed as being part of this example system. Did I miss something??
So I am trying to create a search program that uses indexing to find results very fast. The first thing you do is you go through your full file system and list all the paths in your system and you give them numbers ( indexes ) . For example you start with / directory so ...
1 . /
2 . /home
3 . /var
4 . /etc
and a lot more . then you enter each of those directories and continue the same process
..
25 . /etc/httpd
26 . /etc/hp
That is the path file. The next step is to list every name for every file or document or whatever is in your system in a new file called the inverted file . So lets say as we go through we bump into files called test_file more than once in our system . On of them is in /var , the other is in /home and one in /etc/httpd,so what the inverted file would look like is :
test_file : { 3 , 2 , 25} ---> as you can 2 and 3 and 25 are the values for /home & /var & /etc/httpd in the path file.
Thats how it works . So now if the user looks for test_file , you lookup test_file and you know the numbers and corresponding paths for that name and you tell them where it is instead of searching the system everytime .
Now my question and the reason I started this post was this : I dont want to use text files for this program , because the names in this list have arbitrary length and accessing the row we want would be very time consuming . So I though I would go with sql and use mysql api in C to create a database , with tables , 1 for my paths , and 1 for my filenames ( with sub tables for each letter , so we could access them very fast ) . But in the case of mysql I cant create the example above :
test_file : { 3 , 2 , 25}
because sql only lets 1 data in 1 column ( there are data types called sets and enum but they are not appropriate ) .
Thats what I would like help with . What kind of approach would you use to create these tables or files or whatever .....
If you do it in a SQL way, you can create 3 tables:
(mind: this is example only, supports directory-names up to 1024 characters long)
create table directorynames(
id integer primary key not null,
name varchar(1024) unique
);
create table filenames(
id integer primary key not null,
name varchar(1024) unique
);
create table file_in_dir(
directory_id integer references directorynames(id) not null,
filename_id integer references filenames(id) not null,
primary key on (directory_id, filename_id)
);
you put the (unique) directory names in the appropriate directory, as you do with the file names;
Then for a filename in a directory you can do something like this:
Code:
INSERT INTO file_in_dir (directory_id, filename_id) SET (
(SELECT id FROM directories WHERE name="directoryname"),
(SELECT id FROM filenames WHERE name="filename")
);
this fixes constraints for a many-to-many relationship.
I'd personally use a different approach: flat files.
(Hey! Why did I hear groaning? )
Assuming you want to be able to find files or directories based on a partial name match too, I'd create a file containing each known file or directory name, separated by a slash (/). EOS (zero) and slash are the only two characters (eight-bit values, actually) that are not allowed in POSIX file and directory names. Put a slash as the first character in the file; it makes both searching easier since every name is then enclosed by slashes (/name/), but it also makes it easier to deal with the root directory which has no name other than /.
To describe the directory tree structure, you only need a binary file containing an array of off_t pairs -- call this the reference array:
Code:
struct ref {
off_t offset; /* Offset of first character of name */
off_t parent; /* Index (in this array) of the parent */
};
offset is the offset of the first character of the name. parent is the index in the reference array of the parent directory. It is best if you keep this array sorted with offsets in ascending order. (It's actually what happens when you build the array normally!)
Since root directory name is effectively empty (since it's just /), root directory is naturally at offset zero, and the first (index 0) of the reference array. (Yes, the first item in the array is then a double zero.)
When you find an interesting name in the text file, you backtrack to find (the offset of) the first character of the file name. Then, you find a record in the reference array that refers to that name (has that offset). If the offset fields are in sorted order, you can use a very fast binary search to locate the entry.
Then, you track the parent references, prepending each parent directory name in front of the file name (to get the full path), until the parent is zero -- root directory.
Note that this also allows you to delete files in the text file, simply by replacing the name with a string of slashes. You can add new files by appending the name and the reference to the end of the files. You could even use inotify to keep track of file name changes in real time. Even then you should rebuild the databases every now and then, to make sure the files stay dense enough, and do not grow too large.
I hope you found this useful, or interesting at least,
Interesting post! I do believe that databases use a likewise approach to sort and index text for fast searching.
I was just triggered by the SQL command, and me being too lazy to think out how to write a good tree and search algorithm in C(++), I thought: hey, SQL: why not ;-)
Ages ago, during my classes at school I learned normalization on databases. One of the normalization steps is to solve any n-on-m relation, often fixed as described above:
create a table for each of the two relations and create a table that describes their relation (and still honors the rules of good database design). Using JOIN, you can even list the data in the 3rd table not by their numbers, but by the names:
Code:
select directorynames.name AS directoryname,
filenames.name AS filename
FROM file_in_dir
INNER JOIN directorynames ON directory_id = directorynames.id
INNER JOIN filenames ON filename_id = filenames.id;
in this small test environment, I got a result that looked like this:
you can see that test.dat is in /opt and /home; it's the same record in the filenames table:
Code:
select * from filenames;
id | name
----+-----------
1 | test.conf
2 | test.dat
Code:
select * from directorynames;
id | name
----+-------
1 | /etc
2 | /opt
3 | /home
(3 rows)
Anyway chymeira, if you find a response useful, it is generally good practice here to click the appropriate links at the bottom of that post or rate the response the person has given you, using the scales at their user-information on the left of each post.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.