Search text inside PDF files

yogomix · 07-13-2010, 09:27 AM

Hello again folks,

I'm just wondering if there are any tools that would allow me to search through text in a bundle of PDF files?

I know this is possible on Mac OSX with Spotlight, so something similar for RHEL5 would be good. I have heard of somethings called Beagle and Recoll, but I can't find versions of these for RHEL5 (64bit). I'm not altogether sure how to install things that aren't scripts or RPMS.

Thanks in advance,
yog

pljvaldez · 07-13-2010, 10:41 AM

One thing I've found with pdf files is that depending on what program generated the pdf, the text may not be true text and therefore you can't search it. I recently blundered across this with a huge directory of pdf's created from a CAD package. When you "Save as pdf" it renders the pdf fine, but doesn't create the text as text. So searching is useless...

As to searching pdf files with text that is true text, you should be able to string a pipe together using find, grep, and pstotext.

Toonses82 · 07-22-2010, 11:16 AM

A PDF is an image of a document so it's treated like a picture, not text. There are two ways to create a PDF; print directly to PDF using a print driver, or scan a paper document from a copier or scanner. If you're printing from a text file to PDF using a print driver, you want to see if it has an option to print as "text-searchable" PDF. If you're scanning a paper document, the scanner/copier needs to have OCR (Optical Character Recognition) capabilities. This is a feature that looks at the image and recognizes text.

Alternatively, if you already have a PDF that is not text searchable, you can find some PDF-editing software that has OCR. I don't know if any linux options exist, but you're looking for an application that probably has additional PDF-editing tools. Many apps like this usually have a few different annotation options in addition to OCR. Things like white-out, redaction, highlighting, sticky notes, and various stamps.

Keep in mind, when you OCR a PDF, this will increase the file size of the document. This is because a text-searchable PDF is still just an image of a document, but now it has an additional text layer behind it that stores the information. If you have a small novel in PDF form, this can significantly increase the size of the file. For this reason, if you have a large repository of PDFs, it's usually not a good idea to OCR them all. For most casual home users, this likely isn't too big of an issue.

Again, I don't know what software is available in the linux world, but I figured it would be useful to know exactly what you're looking for. Hope it helps.

Toonses82 · 07-23-2010, 11:24 AM

I started another thread on a related topic because I'm looking for software to edit and merge PDFs. I'm not sure if it's relevant to you, but you can read it here.

MTK358 · 07-24-2010, 07:56 AM

Quote:

Originally Posted by Toonses82

A PDF is an image of a document so it's treated like a picture, not text.

I'm not convinced that's true.

In a PDF viewer, I can select and copy text.

Some PDFs have text, and some are just a big picture with no actual "text".

Toonses82 · 07-24-2010, 12:44 PM

I think you're talking about the difference between a text searchable PDF and a regular one. If you can highlight and copy text, then you can search the document as well.

I might be wrong. It's not like I invented PDF or something. I just work extensively with PDF and document management software in the enterprise environment, and it's my job to understand the format.

MTK358 · 07-24-2010, 02:02 PM

Yes. The PDFs that contain actual text information can be searched, while te PDFs that just contain pictures of text, obviously can't.

Many PDFs are just photocopies of paper documents, and thus are just pictures.

Linuxant · 09-15-2014, 05:12 AM

It can be done via pdftotext then using grep
pdftotext document.pdf - | grep -C5 -n -i "search term"
http://askubuntu.com/questions/18458...m-command-line
for f in pdf_directory; do echo $f; pdftotext $f - | grep -i "search_term"; done

I found this command
spdf () { find . -name "*.pdf" -print0| while read -d $'' file; do co=$(pdftotext -q "$file" - |grep -c $1); if [ $co -ne 0 ]; then echo $co - "$file" ; fi ; done }

There is also pdfgrep command
http://manpages.ubuntu.com/manpages/...pdfgrep.1.html

Another software is named Recoll can help
http://xmodulo.com/2013/08/how-to-se...-on-linux.html

I can confirm some pdfs are images exported as pdf these are unsearchable.