LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   text indexing program (https://www.linuxquestions.org/questions/linux-newbie-8/text-indexing-program-612825/)

gawain 01-11-2008 05:55 AM

text indexing program
 
Hi everybody.

I'm looking for a program (better if from command line) which can index the words in a txt file associating them with the page number: for instance
pages
------
word1 - 1,4,6
word2 - 5,9,23
word3 - 7,,44,88

Thanks for any help

matthewg42 01-11-2008 06:02 AM

How would you define pages in a plaintext file?

gawain 01-11-2008 07:24 AM

Thanks for aswering

Quote:

How would you define pages in a plaintext file?
I didn't think of it. I just have pdf files which I have turned into txt files via pdftotext (from command line): the PDF layout has been preserved and the number page has been inserted as text in the txt file.

So, I guess that there is no way out to my original question? And If I want to index all the words in the pages I need to change the format of the file?

matthewg42 01-11-2008 08:00 AM

There may be a control character inserted into the text file at the end of each page, in which case a program could easily determine the page number using that. There is an ASCII code for it, named FF (form feed, hex value 0C, octal 014). You can insert a FF in vim using control-L.

If your text files have this character at the end of each page, you could write a very quick Perl script to do your indexing, something like this:
Code:

#!/usr/bin/perl

use warnings;
use strict;

my $page_number = 1;
my %idx;

while(<>) {
        for my $word ( split(/\s+/) ) {
                $idx{$word}{$page_number} = 1;
        }
        $page_number++ if ( /\014/ );
}

foreach my $word (sort keys %idx) {
        printf "%-20s ", $word;
        print comma_sep(sort {$a <=> $b} keys %{$idx{$word}});
        print "\n";
}

sub comma_sep {
        my $ret = "";
        foreach (@_) { $ret .= $_ . ", "; }
        chop $ret;
        chop $ret;
        return $ret;
}

This assumes the FF character is at the END of a line (or on a line of it's own), and that you want to index all non-blank words. It would be easy to implement some filtering of that to index... just add an if clause to the value which sets idx values.

gawain 01-11-2008 09:41 AM

Thanks a lot but ... that's too beautiful to be true!

It's my first perl script

From Command line and from the directory of the perl script and txt file I run: perl perl_script.pl (I hope the extension should be right; I have checked it out). But nothing happens: for, how do I insert the file to be indexed?

matthewg42 01-12-2008 05:54 AM

First you need to make the script file executable:
Code:

chmod 755 perl_script.pl
And then you can run it like any other program. Presumably the script file's directory is not in your PATH, and maybe the current directory is not in your PATH (this is good practise). You can call it by explicitly saying "./filename", where . means "current directory". In perl the <> operator will read input from standard input, or from files named on the command like. Either of these will work for a text file called "input.txt":
Code:

./perl_script.pl input.txt
./perl_script.pl < input.txt

The second example using shell re-direction to pass the input via the standard input file handle.

gawain 01-12-2008 09:23 AM

Grand.

Let me offer yoy a Guiness when you come to Rome.

One last thing, if you don't mind: I've saved - after a few tries - the output like this

Quote:

./perl_scrit.pl input.txt | cat > output.txt
It should be OK because I got the words and the pages, but maybe there is a more orthodox way to grab the output.

Thanks again

matthewg42 01-12-2008 10:40 AM

All cat does is read standard input and write to standard output. You can accomplish the same thing like this:
Code:

./perl_scrit.pl input.txt > output.txt
You'll end up with the same thing, but with one less process invoked, and the related pipe overhead.

gawain 01-14-2008 05:50 AM

Thanks a lot


All times are GMT -5. The time now is 02:26 AM.