Linux - DesktopThis forum is for the discussion of all Linux Software used in a desktop context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm just wondering if there are any tools that would allow me to search through text in a bundle of PDF files?
I know this is possible on Mac OSX with Spotlight, so something similar for RHEL5 would be good. I have heard of somethings called Beagle and Recoll, but I can't find versions of these for RHEL5 (64bit). I'm not altogether sure how to install things that aren't scripts or RPMS.
One thing I've found with pdf files is that depending on what program generated the pdf, the text may not be true text and therefore you can't search it. I recently blundered across this with a huge directory of pdf's created from a CAD package. When you "Save as pdf" it renders the pdf fine, but doesn't create the text as text. So searching is useless...
As to searching pdf files with text that is true text, you should be able to string a pipe together using find, grep, and pstotext.
A PDF is an image of a document so it's treated like a picture, not text. There are two ways to create a PDF; print directly to PDF using a print driver, or scan a paper document from a copier or scanner. If you're printing from a text file to PDF using a print driver, you want to see if it has an option to print as "text-searchable" PDF. If you're scanning a paper document, the scanner/copier needs to have OCR (Optical Character Recognition) capabilities. This is a feature that looks at the image and recognizes text.
Alternatively, if you already have a PDF that is not text searchable, you can find some PDF-editing software that has OCR. I don't know if any linux options exist, but you're looking for an application that probably has additional PDF-editing tools. Many apps like this usually have a few different annotation options in addition to OCR. Things like white-out, redaction, highlighting, sticky notes, and various stamps.
Keep in mind, when you OCR a PDF, this will increase the file size of the document. This is because a text-searchable PDF is still just an image of a document, but now it has an additional text layer behind it that stores the information. If you have a small novel in PDF form, this can significantly increase the size of the file. For this reason, if you have a large repository of PDFs, it's usually not a good idea to OCR them all. For most casual home users, this likely isn't too big of an issue.
Again, I don't know what software is available in the linux world, but I figured it would be useful to know exactly what you're looking for. Hope it helps.
I started another thread on a related topic because I'm looking for software to edit and merge PDFs. I'm not sure if it's relevant to you, but you can read it here.
I think you're talking about the difference between a text searchable PDF and a regular one. If you can highlight and copy text, then you can search the document as well.
I might be wrong. It's not like I invented PDF or something. I just work extensively with PDF and document management software in the enterprise environment, and it's my job to understand the format.
It can be done via pdftotext then using grep
pdftotext document.pdf - | grep -C5 -n -i "search term" http://askubuntu.com/questions/18458...m-command-line
for f in pdf_directory; do echo $f; pdftotext $f - | grep -i "search_term"; done
I found this command
spdf () { find . -name "*.pdf" -print0| while read -d $'' file; do co=$(pdftotext -q "$file" - |grep -c $1); if [ $co -ne 0 ]; then echo $co - "$file" ; fi ; done }
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.