Inverted file

chymeira · 04-04-2011, 10:00 PM

Hello all ,

I am writing a small search program for my class. I have decided to use indexing for my program. Ive researched online about indexing and how search engines do it. If im gonno do that I need to create inverted files to associate files to numbers ( numbers being the index of my paths ) . Now I was wondering what would be the best way to create an inverted file ? I was going to create sql tables using mysql api in C but then again there is no array data type or vectors to store few numbers in a single column in mysql and it is not advised to use Enum or SET .

Any help would be very appreciated .
Thanks a lot.

vikas027 · 04-06-2011, 10:10 AM

Quote:

Originally Posted by chymeira

Hello all ,

I am writing a small search program for my class. I have decided to use indexing for my program. Ive researched online about indexing and how search engines do it. If im gonno do that I need to create inverted files to associate files to numbers ( numbers being the index of my paths ) . Now I was wondering what would be the best way to create an inverted file ? I was going to create sql tables using mysql api in C but then again there is no array data type or vectors to store few numbers in a single column in mysql and it is not advised to use Enum or SET .

Any help would be very appreciated .
Thanks a lot.

What do you mean by "inverted" file ? Please give an example.
Also, what programming language you are using, I would recommend bash shell

knudfl · 04-06-2011, 10:18 AM

Not inverted : cat -n text-file.txt > new-file : A file with line numbers added.

Inverted : tac -n text-file.txt > new-file

chymeira · 04-06-2011, 11:34 AM

Inverted files are used for indexing in searching.
Lets say I have these directories in my root :

/ ---> bin sbin var .....

So I create a file that gives numbers to each of these :

1 . bin
2 . sbin
3 . var

Then I start creating an inverted text file for everything on my hard drive .
Lets say I have a document that is in /home/usr/ . This path has the value 6 in the file above , so in my inverted file it would look like :

docuement { 6 }

If you find the same document in another path , you add the number of the path to that list ( vector ).

MTK358 · 04-06-2011, 11:47 AM

Quote:

Originally Posted by chymeira

Then I start creating an inverted text file for everything on my hard drive .

You still didn't say what it means to "invert".

chymeira · 04-06-2011, 12:16 PM

Its just what its called , u dont invert anything ....

Its just a bunch of lines with strings and vectores associated with them

graemef · 04-06-2011, 09:56 PM

Quote:

Originally Posted by MTK358

You still didn't say what it means to "invert".

But wikipedea does

If you want to do this using SQL then you will need to create a resolver table
document table
path table

document_path table (your inverted index which is a many to many relationship between the other two tables)

chymeira · 04-06-2011, 11:49 PM

well Im not sure that sql would be the best way to do it .

Any suggestions ?

graemef · 04-07-2011, 12:08 AM

You can use any programming language that supports file IO. A language that provides support for list would be a boon. But it all comes down to what you know given that the requirements could be met by so many languages.

grail · 04-07-2011, 12:18 AM

I am not sure I follow the example given (could just be me). You say you have 3 directories under root (for the example), but you then add to the
example the path /home/usr that has a document yet home is not listed as being part of this example system. Did I miss something??

chymeira · 04-07-2011, 12:44 AM

OK I will try to be clear this time.

So I am trying to create a search program that uses indexing to find results very fast. The first thing you do is you go through your full file system and list all the paths in your system and you give them numbers ( indexes ) . For example you start with / directory so ...

1 . /
2 . /home
3 . /var
4 . /etc

and a lot more . then you enter each of those directories and continue the same process

..
25 . /etc/httpd
26 . /etc/hp

That is the path file. The next step is to list every name for every file or document or whatever is in your system in a new file called the inverted file . So lets say as we go through we bump into files called test_file more than once in our system . On of them is in /var , the other is in /home and one in /etc/httpd,so what the inverted file would look like is :

test_file : { 3 , 2 , 25} ---> as you can 2 and 3 and 25 are the values for /home & /var & /etc/httpd in the path file.

Thats how it works . So now if the user looks for test_file , you lookup test_file and you know the numbers and corresponding paths for that name and you tell them where it is instead of searching the system everytime .

Now my question and the reason I started this post was this : I dont want to use text files for this program , because the names in this list have arbitrary length and accessing the row we want would be very time consuming . So I though I would go with sql and use mysql api in C to create a database , with tables , 1 for my paths , and 1 for my filenames ( with sub tables for each letter , so we could access them very fast ) . But in the case of mysql I cant create the example above :

test_file : { 3 , 2 , 25}
because sql only lets 1 data in 1 column ( there are data types called sets and enum but they are not appropriate ) .

Thats what I would like help with . What kind of approach would you use to create these tables or files or whatever .....

Thank you very much and I hope I didnt bore you.

Ramurd · 04-07-2011, 02:25 AM

If you do it in a SQL way, you can create 3 tables:

(mind: this is example only, supports directory-names up to 1024 characters long)
create table directorynames(
id integer primary key not null,
name varchar(1024) unique
);

create table filenames(
id integer primary key not null,
name varchar(1024) unique
);

create table file_in_dir(
directory_id integer references directorynames(id) not null,
filename_id integer references filenames(id) not null,
primary key on (directory_id, filename_id)
);

you put the (unique) directory names in the appropriate directory, as you do with the file names;

Then for a filename in a directory you can do something like this:

Code:

INSERT INTO file_in_dir (directory_id, filename_id) SET (
      (SELECT id FROM directories WHERE name="directoryname"),
      (SELECT id FROM filenames WHERE name="filename")
);

this fixes constraints for a many-to-many relationship.

chymeira · 04-07-2011, 06:16 PM

Ramurd ,

Thx a lot for your response . It was very helpful

Nominal Animal · 04-07-2011, 08:05 PM

I'd personally use a different approach: flat files.

(Hey! Why did I hear groaning?

)

Assuming you want to be able to find files or directories based on a partial name match too, I'd create a file containing each known file or directory name, separated by a slash (/). EOS (zero) and slash are the only two characters (eight-bit values, actually) that are not allowed in POSIX file and directory names. Put a slash as the first character in the file; it makes both searching easier since every name is then enclosed by slashes (/name/), but it also makes it easier to deal with the root directory which has no name other than /.

To describe the directory tree structure, you only need a binary file containing an array of off_t pairs -- call this the reference array:

Code:

struct ref {
    off_t  offset;  /* Offset of first character of name */
    off_t  parent;  /* Index (in this array) of the parent */
};

offset is the offset of the first character of the name. parent is the index in the reference array of the parent directory. It is best if you keep this array sorted with offsets in ascending order. (It's actually what happens when you build the array normally!)

Since root directory name is effectively empty (since it's just /), root directory is naturally at offset zero, and the first (index 0) of the reference array. (Yes, the first item in the array is then a double zero.)

When you find an interesting name in the text file, you backtrack to find (the offset of) the first character of the file name. Then, you find a record in the reference array that refers to that name (has that offset). If the offset fields are in sorted order, you can use a very fast binary search to locate the entry.
Then, you track the parent references, prepending each parent directory name in front of the file name (to get the full path), until the parent is zero -- root directory.

Note that this also allows you to delete files in the text file, simply by replacing the name with a string of slashes. You can add new files by appending the name and the reference to the end of the files. You could even use inotify to keep track of file name changes in real time. Even then you should rebuild the databases every now and then, to make sure the files stay dense enough, and do not grow too large.

I hope you found this useful, or interesting at least,

Ramurd · 04-08-2011, 12:45 AM

Hey Nominal,

Interesting post! I do believe that databases use a likewise approach to sort and index text for fast searching.

I was just triggered by the SQL command, and me being too lazy to think out how to write a good tree and search algorithm in C(++), I thought: hey, SQL: why not ;-)

Ages ago, during my classes at school I learned normalization on databases. One of the normalization steps is to solve any n-on-m relation, often fixed as described above:
create a table for each of the two relations and create a table that describes their relation (and still honors the rules of good database design). Using JOIN, you can even list the data in the 3rd table not by their numbers, but by the names:

Code:

select directorynames.name AS directoryname,
       filenames.name AS filename 
       FROM file_in_dir 
       INNER JOIN directorynames ON directory_id = directorynames.id 
       INNER JOIN filenames ON filename_id = filenames.id;

in this small test environment, I got a result that looked like this:

Code:

 directoryname | filename  
---------------+-----------
 /etc          | test.conf
 /opt          | test.dat
 /home         | test.dat

you can see that test.dat is in /opt and /home; it's the same record in the filenames table:

Code:

select * from filenames;
 id |   name    
----+-----------
  1 | test.conf
  2 | test.dat

Code:

 select * from directorynames;
 id | name  
----+-------
  1 | /etc
  2 | /opt
  3 | /home
(3 rows)

Anyway chymeira, if you find a response useful, it is generally good practice here to click the appropriate links at the bottom of that post or rate the response the person has given you, using the scales at their user-information on the left of each post.