Searching multiple pdf files for text

mikemrh9 · 06-23-2009, 07:16 PM

Hi.

I have a big stack of .pdf files that I need to reference, and would like to be able to quickly search them for key words.

For example, it would be great if I could get an output that simply gives me a list of filenames for pdf files containing the word `bananas`

I would like to do this without creating a mountain of text files, but am struggling a little as my bash skills are not up to much.

Here is what I have so far:

for i in `ls *.pdf`; do pdftotext $i -; done

I would like to pipe this through `grep -l` to filter out filenames containing my key words, but am not sure that this will work as I think that it is the contents of the file only and not the filename that get handled by `pdftotext`? I've tried piping to grep and have tried redirecting std output, but so far everything has been doomed to failure.

I can do it if I use pdftotext to create text files first, then run grep on those, but I am in search of a more elegant solution!

Ideas gratefully accepted...

TomAmundsen · 06-23-2009, 08:56 PM

It seems what you want to do is certainly impossible, since pdftotext writes to files and not standard out. You'd have to somehow re-write pdftotext to output to stdout and not some file specified by name at the command-line.

The best solution I can think of is something like this:

pdftotext $file_name.pdf; cat $file_name.txt | grep pattern; rm $file_name.txt

You can use a template like this on the inside of your loop. It's not elegant, but at least you can remove the intermediate text files....

Uncle_Theodore · 06-24-2009, 12:06 AM

Well, not quite. You can make pdftotext write to stdout just fine. Like this

Code:

pdftotext filename.pdf -

(notice the dash at the end). So, something like this should work

Code:

 for filename in *.pdf; do if [ -n "$(pdftotext $filename - | grep banana)" ]; then echo "There is a banana in $filename"; fi;  done

TomAmundsen · 06-24-2009, 02:41 AM

Well played, sir. I should have read the man page more carefully.

mikemrh9 · 06-24-2009, 04:17 AM

That's fantastic - It's going to save me hours!

Thanks very much!

What's the "-n" for? It seems to work with or without it.

Phieth6o · 06-24-2009, 06:06 AM

But don't take the output to be complete! This method won't work on pdf containing hyphenation at linebreak, e. g. ba-\nnana. Best would be to strip the output of pdftotext of hyphens and linebreaks before you grep it. Otherwise you shouldn't rely on the one-liner too much.

[EDIT] Checked it on one paper and it seems that it's not linebreaks that matter but pagebreaks, especially when there's header or footer text that would be very hard to strip with sed.

mikemrh9 · 06-24-2009, 06:47 AM

Good point - thanks! There's a "-nopgbrk" switch to pdftotext which may help here.

Of course the other issue is this (from the man page):

"Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files."

However, that's just a small subset of my papers, and I'm happy with the effort which all this is going to save me!