Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm looking for a program (better if from command line) which can index the words in a txt file associating them with the page number: for instance
pages
------
word1 - 1,4,6
word2 - 5,9,23
word3 - 7,,44,88
I didn't think of it. I just have pdf files which I have turned into txt files via pdftotext (from command line): the PDF layout has been preserved and the number page has been inserted as text in the txt file.
So, I guess that there is no way out to my original question? And If I want to index all the words in the pages I need to change the format of the file?
There may be a control character inserted into the text file at the end of each page, in which case a program could easily determine the page number using that. There is an ASCII code for it, named FF (form feed, hex value 0C, octal 014). You can insert a FF in vim using control-L.
If your text files have this character at the end of each page, you could write a very quick Perl script to do your indexing, something like this:
Code:
#!/usr/bin/perl
use warnings;
use strict;
my $page_number = 1;
my %idx;
while(<>) {
for my $word ( split(/\s+/) ) {
$idx{$word}{$page_number} = 1;
}
$page_number++ if ( /\014/ );
}
foreach my $word (sort keys %idx) {
printf "%-20s ", $word;
print comma_sep(sort {$a <=> $b} keys %{$idx{$word}});
print "\n";
}
sub comma_sep {
my $ret = "";
foreach (@_) { $ret .= $_ . ", "; }
chop $ret;
chop $ret;
return $ret;
}
This assumes the FF character is at the END of a line (or on a line of it's own), and that you want to index all non-blank words. It would be easy to implement some filtering of that to index... just add an if clause to the value which sets idx values.
Last edited by matthewg42; 01-11-2008 at 08:03 AM.
Thanks a lot but ... that's too beautiful to be true!
It's my first perl script
From Command line and from the directory of the perl script and txt file I run: perl perl_script.pl (I hope the extension should be right; I have checked it out). But nothing happens: for, how do I insert the file to be indexed?
First you need to make the script file executable:
Code:
chmod 755 perl_script.pl
And then you can run it like any other program. Presumably the script file's directory is not in your PATH, and maybe the current directory is not in your PATH (this is good practise). You can call it by explicitly saying "./filename", where . means "current directory". In perl the <> operator will read input from standard input, or from files named on the command like. Either of these will work for a text file called "input.txt":
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.