LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 06-23-2009, 07:16 PM   #1
mikemrh9
Member
 
Registered: Nov 2003
Distribution: Arch
Posts: 136

Rep: Reputation: 21
Searching multiple pdf files for text


Hi.

I have a big stack of .pdf files that I need to reference, and would like to be able to quickly search them for key words.

For example, it would be great if I could get an output that simply gives me a list of filenames for pdf files containing the word `bananas`

I would like to do this without creating a mountain of text files, but am struggling a little as my bash skills are not up to much.

Here is what I have so far:

for i in `ls *.pdf`; do pdftotext $i -; done

I would like to pipe this through `grep -l` to filter out filenames containing my key words, but am not sure that this will work as I think that it is the contents of the file only and not the filename that get handled by `pdftotext`? I've tried piping to grep and have tried redirecting std output, but so far everything has been doomed to failure.

I can do it if I use pdftotext to create text files first, then run grep on those, but I am in search of a more elegant solution!

Ideas gratefully accepted...
 
Old 06-23-2009, 08:56 PM   #2
TomAmundsen
LQ Newbie
 
Registered: Jun 2009
Location: Los Angeles, CA
Distribution: FreeBSD
Posts: 3

Rep: Reputation: 0
It seems what you want to do is certainly impossible, since pdftotext writes to files and not standard out. You'd have to somehow re-write pdftotext to output to stdout and not some file specified by name at the command-line.

The best solution I can think of is something like this:

pdftotext $file_name.pdf; cat $file_name.txt | grep pattern; rm $file_name.txt

You can use a template like this on the inside of your loop. It's not elegant, but at least you can remove the intermediate text files....

Last edited by TomAmundsen; 06-23-2009 at 10:09 PM.
 
Old 06-24-2009, 12:06 AM   #3
Uncle_Theodore
Member
 
Registered: Dec 2007
Location: Charleston WV, USA
Distribution: Slackware 12.2, Arch Linux Amd64
Posts: 896

Rep: Reputation: 71
Well, not quite. You can make pdftotext write to stdout just fine. Like this

Code:
pdftotext filename.pdf -
(notice the dash at the end). So, something like this should work

Code:
 for filename in *.pdf; do if [ -n "$(pdftotext $filename - | grep banana)" ]; then echo "There is a banana in $filename"; fi;  done
 
Old 06-24-2009, 02:41 AM   #4
TomAmundsen
LQ Newbie
 
Registered: Jun 2009
Location: Los Angeles, CA
Distribution: FreeBSD
Posts: 3

Rep: Reputation: 0
Well played, sir. I should have read the man page more carefully.
 
Old 06-24-2009, 04:17 AM   #5
mikemrh9
Member
 
Registered: Nov 2003
Distribution: Arch
Posts: 136

Original Poster
Rep: Reputation: 21
That's fantastic - It's going to save me hours!

Thanks very much!

What's the "-n" for? It seems to work with or without it.
 
Old 06-24-2009, 06:06 AM   #6
Phieth6o
LQ Newbie
 
Registered: Dec 2007
Posts: 17

Rep: Reputation: 0
But don't take the output to be complete! This method won't work on pdf containing hyphenation at linebreak, e. g. ba-\nnana. Best would be to strip the output of pdftotext of hyphens and linebreaks before you grep it. Otherwise you shouldn't rely on the one-liner too much.

[EDIT] Checked it on one paper and it seems that it's not linebreaks that matter but pagebreaks, especially when there's header or footer text that would be very hard to strip with sed.

Last edited by Phieth6o; 06-24-2009 at 06:14 AM.
 
Old 06-24-2009, 06:47 AM   #7
mikemrh9
Member
 
Registered: Nov 2003
Distribution: Arch
Posts: 136

Original Poster
Rep: Reputation: 21
Good point - thanks! There's a "-nopgbrk" switch to pdftotext which may help here.

Of course the other issue is this (from the man page):

"Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files."

However, that's just a small subset of my papers, and I'm happy with the effort which all this is going to save me!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
searching text inside .pdf/.chm files? paperplane Linux - Newbie 2 02-03-2008 07:31 AM
searching for Text within files. waelaltaqi Linux - Software 2 06-18-2007 02:56 PM
Searching text files by content will103 Linux - Software 1 01-24-2005 07:43 AM
searching inside text files minm Linux - Newbie 2 01-08-2005 11:56 PM
searching for multiple files ryedunn Linux - Newbie 4 09-27-2004 03:21 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 08:17 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration