Linux - DesktopThis forum is for the discussion of all Linux Software used in a desktop context.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I'm just wondering if there are any tools that would allow me to search through text in a bundle of PDF files?
I know this is possible on Mac OSX with Spotlight, so something similar for RHEL5 would be good. I have heard of somethings called Beagle and Recoll, but I can't find versions of these for RHEL5 (64bit). I'm not altogether sure how to install things that aren't scripts or RPMS.
One thing I've found with pdf files is that depending on what program generated the pdf, the text may not be true text and therefore you can't search it. I recently blundered across this with a huge directory of pdf's created from a CAD package. When you "Save as pdf" it renders the pdf fine, but doesn't create the text as text. So searching is useless...
As to searching pdf files with text that is true text, you should be able to string a pipe together using find, grep, and pstotext.
A PDF is an image of a document so it's treated like a picture, not text. There are two ways to create a PDF; print directly to PDF using a print driver, or scan a paper document from a copier or scanner. If you're printing from a text file to PDF using a print driver, you want to see if it has an option to print as "text-searchable" PDF. If you're scanning a paper document, the scanner/copier needs to have OCR (Optical Character Recognition) capabilities. This is a feature that looks at the image and recognizes text.
Alternatively, if you already have a PDF that is not text searchable, you can find some PDF-editing software that has OCR. I don't know if any linux options exist, but you're looking for an application that probably has additional PDF-editing tools. Many apps like this usually have a few different annotation options in addition to OCR. Things like white-out, redaction, highlighting, sticky notes, and various stamps.
Keep in mind, when you OCR a PDF, this will increase the file size of the document. This is because a text-searchable PDF is still just an image of a document, but now it has an additional text layer behind it that stores the information. If you have a small novel in PDF form, this can significantly increase the size of the file. For this reason, if you have a large repository of PDFs, it's usually not a good idea to OCR them all. For most casual home users, this likely isn't too big of an issue.
Again, I don't know what software is available in the linux world, but I figured it would be useful to know exactly what you're looking for. Hope it helps.