LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-11-2008, 05:55 AM   #1
gawain
Member
 
Registered: Dec 2006
Location: Italy -Rome
Distribution: Slackware 11.0
Posts: 55

Rep: Reputation: 15
text indexing program


Hi everybody.

I'm looking for a program (better if from command line) which can index the words in a txt file associating them with the page number: for instance
pages
------
word1 - 1,4,6
word2 - 5,9,23
word3 - 7,,44,88

Thanks for any help
 
Old 01-11-2008, 06:02 AM   #2
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
How would you define pages in a plaintext file?
 
Old 01-11-2008, 07:24 AM   #3
gawain
Member
 
Registered: Dec 2006
Location: Italy -Rome
Distribution: Slackware 11.0
Posts: 55

Original Poster
Rep: Reputation: 15
Thanks for aswering

Quote:
How would you define pages in a plaintext file?
I didn't think of it. I just have pdf files which I have turned into txt files via pdftotext (from command line): the PDF layout has been preserved and the number page has been inserted as text in the txt file.

So, I guess that there is no way out to my original question? And If I want to index all the words in the pages I need to change the format of the file?
 
Old 01-11-2008, 08:00 AM   #4
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
There may be a control character inserted into the text file at the end of each page, in which case a program could easily determine the page number using that. There is an ASCII code for it, named FF (form feed, hex value 0C, octal 014). You can insert a FF in vim using control-L.

If your text files have this character at the end of each page, you could write a very quick Perl script to do your indexing, something like this:
Code:
#!/usr/bin/perl

use warnings;
use strict;

my $page_number = 1;
my %idx;

while(<>) {
        for my $word ( split(/\s+/) ) {
                $idx{$word}{$page_number} = 1;
        }
        $page_number++ if ( /\014/ );
}

foreach my $word (sort keys %idx) {
        printf "%-20s ", $word;
        print comma_sep(sort {$a <=> $b} keys %{$idx{$word}});
        print "\n";
}

sub comma_sep {
        my $ret = "";
        foreach (@_) { $ret .= $_ . ", "; }
        chop $ret;
        chop $ret;
        return $ret;
}
This assumes the FF character is at the END of a line (or on a line of it's own), and that you want to index all non-blank words. It would be easy to implement some filtering of that to index... just add an if clause to the value which sets idx values.

Last edited by matthewg42; 01-11-2008 at 08:03 AM.
 
Old 01-11-2008, 09:41 AM   #5
gawain
Member
 
Registered: Dec 2006
Location: Italy -Rome
Distribution: Slackware 11.0
Posts: 55

Original Poster
Rep: Reputation: 15
Thanks a lot but ... that's too beautiful to be true!

It's my first perl script

From Command line and from the directory of the perl script and txt file I run: perl perl_script.pl (I hope the extension should be right; I have checked it out). But nothing happens: for, how do I insert the file to be indexed?
 
Old 01-12-2008, 05:54 AM   #6
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
First you need to make the script file executable:
Code:
chmod 755 perl_script.pl
And then you can run it like any other program. Presumably the script file's directory is not in your PATH, and maybe the current directory is not in your PATH (this is good practise). You can call it by explicitly saying "./filename", where . means "current directory". In perl the <> operator will read input from standard input, or from files named on the command like. Either of these will work for a text file called "input.txt":
Code:
./perl_script.pl input.txt
./perl_script.pl < input.txt
The second example using shell re-direction to pass the input via the standard input file handle.
 
Old 01-12-2008, 09:23 AM   #7
gawain
Member
 
Registered: Dec 2006
Location: Italy -Rome
Distribution: Slackware 11.0
Posts: 55

Original Poster
Rep: Reputation: 15
Smile

Grand.

Let me offer yoy a Guiness when you come to Rome.

One last thing, if you don't mind: I've saved - after a few tries - the output like this

Quote:
./perl_scrit.pl input.txt | cat > output.txt
It should be OK because I got the words and the pages, but maybe there is a more orthodox way to grab the output.

Thanks again
 
Old 01-12-2008, 10:40 AM   #8
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
All cat does is read standard input and write to standard output. You can accomplish the same thing like this:
Code:
./perl_scrit.pl input.txt > output.txt
You'll end up with the same thing, but with one less process invoked, and the related pipe overhead.
 
Old 01-14-2008, 05:50 AM   #9
gawain
Member
 
Registered: Dec 2006
Location: Italy -Rome
Distribution: Slackware 11.0
Posts: 55

Original Poster
Rep: Reputation: 15
Thanks a lot
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
in Pascal: how to exec a program, discard text output or send to text file Valkyrie_of_valhalla Programming 6 05-02-2007 09:50 AM
Indexing text file Deepak Inbasekaran Programming 7 04-19-2006 07:18 AM
File indexing - Best program to use? IMSargon Linux - Software 2 04-02-2006 02:59 PM
a program for text to html??? Chex Linux - Software 3 02-13-2006 10:43 PM
text-only X config program shanenin Linux - Software 5 11-11-2003 01:26 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 07:15 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration