LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-04-2011, 10:00 PM   #1
chymeira
Member
 
Registered: Dec 2008
Location: CH/IL
Distribution: Slackware 13.1 Fedora 12
Posts: 73

Rep: Reputation: 16
Inverted file


Hello all ,

I am writing a small search program for my class. I have decided to use indexing for my program. Ive researched online about indexing and how search engines do it. If im gonno do that I need to create inverted files to associate files to numbers ( numbers being the index of my paths ) . Now I was wondering what would be the best way to create an inverted file ? I was going to create sql tables using mysql api in C but then again there is no array data type or vectors to store few numbers in a single column in mysql and it is not advised to use Enum or SET .

Any help would be very appreciated .
Thanks a lot.
 
Old 04-06-2011, 10:10 AM   #2
vikas027
Senior Member
 
Registered: May 2007
Location: Sydney
Distribution: RHEL, CentOS, Ubuntu, Debian, OS X
Posts: 1,305

Rep: Reputation: 107Reputation: 107
Quote:
Originally Posted by chymeira View Post
Hello all ,

I am writing a small search program for my class. I have decided to use indexing for my program. Ive researched online about indexing and how search engines do it. If im gonno do that I need to create inverted files to associate files to numbers ( numbers being the index of my paths ) . Now I was wondering what would be the best way to create an inverted file ? I was going to create sql tables using mysql api in C but then again there is no array data type or vectors to store few numbers in a single column in mysql and it is not advised to use Enum or SET .

Any help would be very appreciated .
Thanks a lot.

What do you mean by "inverted" file ? Please give an example.
Also, what programming language you are using, I would recommend bash shell
 
Old 04-06-2011, 10:18 AM   #3
knudfl
LQ 5k Club
 
Registered: Jan 2008
Location: Copenhagen, DK
Distribution: pclos2017 CentOS6.9 CentOS7.4 + 50+ other Linux OS, for test only.
Posts: 17,049

Rep: Reputation: 3459Reputation: 3459Reputation: 3459Reputation: 3459Reputation: 3459Reputation: 3459Reputation: 3459Reputation: 3459Reputation: 3459Reputation: 3459Reputation: 3459
Not inverted : cat -n text-file.txt > new-file : A file with line numbers added.

Inverted : tac -n text-file.txt > new-file
 
Old 04-06-2011, 11:34 AM   #4
chymeira
Member
 
Registered: Dec 2008
Location: CH/IL
Distribution: Slackware 13.1 Fedora 12
Posts: 73

Original Poster
Rep: Reputation: 16
Inverted files are used for indexing in searching.
Lets say I have these directories in my root :

/ ---> bin sbin var .....

So I create a file that gives numbers to each of these :

1 . bin
2 . sbin
3 . var

Then I start creating an inverted text file for everything on my hard drive .
Lets say I have a document that is in /home/usr/ . This path has the value 6 in the file above , so in my inverted file it would look like :

docuement { 6 }

If you find the same document in another path , you add the number of the path to that list ( vector ).
 
Old 04-06-2011, 11:47 AM   #5
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721Reputation: 721
Quote:
Originally Posted by chymeira View Post
Then I start creating an inverted text file for everything on my hard drive .
You still didn't say what it means to "invert".
 
Old 04-06-2011, 12:16 PM   #6
chymeira
Member
 
Registered: Dec 2008
Location: CH/IL
Distribution: Slackware 13.1 Fedora 12
Posts: 73

Original Poster
Rep: Reputation: 16
Its just what its called , u dont invert anything ....

Its just a bunch of lines with strings and vectores associated with them
 
Old 04-06-2011, 09:56 PM   #7
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,379

Rep: Reputation: 148Reputation: 148
Quote:
Originally Posted by MTK358 View Post
You still didn't say what it means to "invert".
But wikipedea does

If you want to do this using SQL then you will need to create a resolver table
document table
path table

document_path table (your inverted index which is a many to many relationship between the other two tables)
 
Old 04-06-2011, 11:49 PM   #8
chymeira
Member
 
Registered: Dec 2008
Location: CH/IL
Distribution: Slackware 13.1 Fedora 12
Posts: 73

Original Poster
Rep: Reputation: 16
well Im not sure that sql would be the best way to do it .

Any suggestions ?
 
Old 04-07-2011, 12:08 AM   #9
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,379

Rep: Reputation: 148Reputation: 148
You can use any programming language that supports file IO. A language that provides support for list would be a boon. But it all comes down to what you know given that the requirements could be met by so many languages.
 
Old 04-07-2011, 12:18 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,608

Rep: Reputation: 2936Reputation: 2936Reputation: 2936Reputation: 2936Reputation: 2936Reputation: 2936Reputation: 2936Reputation: 2936Reputation: 2936Reputation: 2936Reputation: 2936
I am not sure I follow the example given (could just be me). You say you have 3 directories under root (for the example), but you then add to the
example the path /home/usr that has a document yet home is not listed as being part of this example system. Did I miss something??
 
Old 04-07-2011, 12:44 AM   #11
chymeira
Member
 
Registered: Dec 2008
Location: CH/IL
Distribution: Slackware 13.1 Fedora 12
Posts: 73

Original Poster
Rep: Reputation: 16
OK I will try to be clear this time.

So I am trying to create a search program that uses indexing to find results very fast. The first thing you do is you go through your full file system and list all the paths in your system and you give them numbers ( indexes ) . For example you start with / directory so ...

1 . /
2 . /home
3 . /var
4 . /etc

and a lot more . then you enter each of those directories and continue the same process

..
25 . /etc/httpd
26 . /etc/hp

That is the path file. The next step is to list every name for every file or document or whatever is in your system in a new file called the inverted file . So lets say as we go through we bump into files called test_file more than once in our system . On of them is in /var , the other is in /home and one in /etc/httpd,so what the inverted file would look like is :

test_file : { 3 , 2 , 25} ---> as you can 2 and 3 and 25 are the values for /home & /var & /etc/httpd in the path file.

Thats how it works . So now if the user looks for test_file , you lookup test_file and you know the numbers and corresponding paths for that name and you tell them where it is instead of searching the system everytime .

Now my question and the reason I started this post was this : I dont want to use text files for this program , because the names in this list have arbitrary length and accessing the row we want would be very time consuming . So I though I would go with sql and use mysql api in C to create a database , with tables , 1 for my paths , and 1 for my filenames ( with sub tables for each letter , so we could access them very fast ) . But in the case of mysql I cant create the example above :

test_file : { 3 , 2 , 25}
because sql only lets 1 data in 1 column ( there are data types called sets and enum but they are not appropriate ) .

Thats what I would like help with . What kind of approach would you use to create these tables or files or whatever .....


Thank you very much and I hope I didnt bore you.
 
Old 04-07-2011, 02:25 AM   #12
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 682

Rep: Reputation: 105Reputation: 105
If you do it in a SQL way, you can create 3 tables:

(mind: this is example only, supports directory-names up to 1024 characters long)
create table directorynames(
id integer primary key not null,
name varchar(1024) unique
);

create table filenames(
id integer primary key not null,
name varchar(1024) unique
);

create table file_in_dir(
directory_id integer references directorynames(id) not null,
filename_id integer references filenames(id) not null,
primary key on (directory_id, filename_id)
);

you put the (unique) directory names in the appropriate directory, as you do with the file names;

Then for a filename in a directory you can do something like this:
Code:
INSERT INTO file_in_dir (directory_id, filename_id) SET (
      (SELECT id FROM directories WHERE name="directoryname"),
      (SELECT id FROM filenames WHERE name="filename")
);
this fixes constraints for a many-to-many relationship.
 
Old 04-07-2011, 06:16 PM   #13
chymeira
Member
 
Registered: Dec 2008
Location: CH/IL
Distribution: Slackware 13.1 Fedora 12
Posts: 73

Original Poster
Rep: Reputation: 16
Ramurd ,

Thx a lot for your response . It was very helpful

Last edited by chymeira; 04-07-2011 at 08:03 PM.
 
Old 04-07-2011, 08:05 PM   #14
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947Reputation: 947
I'd personally use a different approach: flat files.

(Hey! Why did I hear groaning? )

Assuming you want to be able to find files or directories based on a partial name match too, I'd create a file containing each known file or directory name, separated by a slash (/). EOS (zero) and slash are the only two characters (eight-bit values, actually) that are not allowed in POSIX file and directory names. Put a slash as the first character in the file; it makes both searching easier since every name is then enclosed by slashes (/name/), but it also makes it easier to deal with the root directory which has no name other than /.

To describe the directory tree structure, you only need a binary file containing an array of off_t pairs -- call this the reference array:
Code:
struct ref {
    off_t  offset;  /* Offset of first character of name */
    off_t  parent;  /* Index (in this array) of the parent */
};
offset is the offset of the first character of the name. parent is the index in the reference array of the parent directory. It is best if you keep this array sorted with offsets in ascending order. (It's actually what happens when you build the array normally!)

Since root directory name is effectively empty (since it's just /), root directory is naturally at offset zero, and the first (index 0) of the reference array. (Yes, the first item in the array is then a double zero.)

When you find an interesting name in the text file, you backtrack to find (the offset of) the first character of the file name. Then, you find a record in the reference array that refers to that name (has that offset). If the offset fields are in sorted order, you can use a very fast binary search to locate the entry.
Then, you track the parent references, prepending each parent directory name in front of the file name (to get the full path), until the parent is zero -- root directory.

Note that this also allows you to delete files in the text file, simply by replacing the name with a string of slashes. You can add new files by appending the name and the reference to the end of the files. You could even use inotify to keep track of file name changes in real time. Even then you should rebuild the databases every now and then, to make sure the files stay dense enough, and do not grow too large.

I hope you found this useful, or interesting at least,
 
Old 04-08-2011, 12:45 AM   #15
Ramurd
Member
 
Registered: Mar 2009
Location: Rotterdam, the Netherlands
Distribution: Slackwarelinux
Posts: 682

Rep: Reputation: 105Reputation: 105
Hey Nominal,

Interesting post! I do believe that databases use a likewise approach to sort and index text for fast searching.

I was just triggered by the SQL command, and me being too lazy to think out how to write a good tree and search algorithm in C(++), I thought: hey, SQL: why not ;-)

Ages ago, during my classes at school I learned normalization on databases. One of the normalization steps is to solve any n-on-m relation, often fixed as described above:
create a table for each of the two relations and create a table that describes their relation (and still honors the rules of good database design). Using JOIN, you can even list the data in the 3rd table not by their numbers, but by the names:

Code:
select directorynames.name AS directoryname,
       filenames.name AS filename 
       FROM file_in_dir 
       INNER JOIN directorynames ON directory_id = directorynames.id 
       INNER JOIN filenames ON filename_id = filenames.id;
in this small test environment, I got a result that looked like this:
Code:
 directoryname | filename  
---------------+-----------
 /etc          | test.conf
 /opt          | test.dat
 /home         | test.dat
you can see that test.dat is in /opt and /home; it's the same record in the filenames table:

Code:
select * from filenames;
 id |   name    
----+-----------
  1 | test.conf
  2 | test.dat
Code:
 select * from directorynames;
 id | name  
----+-------
  1 | /etc
  2 | /opt
  3 | /home
(3 rows)
Anyway chymeira, if you find a response useful, it is generally good practice here to click the appropriate links at the bottom of that post or rate the response the person has given you, using the scales at their user-information on the left of each post.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is the Linux GMT/UTC definition inverted? jjinno Linux - Software 11 01-05-2010 07:15 PM
Inverted video in Mandriva 2009.1 Richie55 Mandriva 0 05-05-2009 02:39 AM
USB Mouse with inverted axis Teo-sama Linux - Hardware 1 02-04-2009 04:39 PM
Inverted colors after 4 bit color? novalinux DamnSmallLinux 0 06-03-2006 08:39 AM
inverted x server colors. xconspirisist Linux - Software 1 12-03-2005 09:04 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:57 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration