LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Desktop
User Name
Password
Linux - Desktop This forum is for the discussion of all Linux Software used in a desktop context.

Notices

Reply
 
Search this Thread
Old 07-13-2010, 09:27 AM   #1
yogomix
LQ Newbie
 
Registered: May 2008
Posts: 7

Rep: Reputation: 0
Search text inside PDF files


Hello again folks,

I'm just wondering if there are any tools that would allow me to search through text in a bundle of PDF files?

I know this is possible on Mac OSX with Spotlight, so something similar for RHEL5 would be good. I have heard of somethings called Beagle and Recoll, but I can't find versions of these for RHEL5 (64bit). I'm not altogether sure how to install things that aren't scripts or RPMS.

Thanks in advance,
yog
 
Old 07-13-2010, 10:41 AM   #2
pljvaldez
Guru
 
Registered: Dec 2005
Location: Somewhere on the String
Distribution: Debian Squeeze (x86)
Posts: 6,092

Rep: Reputation: 269Reputation: 269Reputation: 269
One thing I've found with pdf files is that depending on what program generated the pdf, the text may not be true text and therefore you can't search it. I recently blundered across this with a huge directory of pdf's created from a CAD package. When you "Save as pdf" it renders the pdf fine, but doesn't create the text as text. So searching is useless...

As to searching pdf files with text that is true text, you should be able to string a pipe together using find, grep, and pstotext.
 
Old 07-22-2010, 11:16 AM   #3
Toonses82
Member
 
Registered: Sep 2004
Location: Olympia, WA, USA
Distribution: Linux Mint 16 Cinnamon
Posts: 117

Rep: Reputation: 15
A PDF is an image of a document so it's treated like a picture, not text. There are two ways to create a PDF; print directly to PDF using a print driver, or scan a paper document from a copier or scanner. If you're printing from a text file to PDF using a print driver, you want to see if it has an option to print as "text-searchable" PDF. If you're scanning a paper document, the scanner/copier needs to have OCR (Optical Character Recognition) capabilities. This is a feature that looks at the image and recognizes text.

Alternatively, if you already have a PDF that is not text searchable, you can find some PDF-editing software that has OCR. I don't know if any linux options exist, but you're looking for an application that probably has additional PDF-editing tools. Many apps like this usually have a few different annotation options in addition to OCR. Things like white-out, redaction, highlighting, sticky notes, and various stamps.

Keep in mind, when you OCR a PDF, this will increase the file size of the document. This is because a text-searchable PDF is still just an image of a document, but now it has an additional text layer behind it that stores the information. If you have a small novel in PDF form, this can significantly increase the size of the file. For this reason, if you have a large repository of PDFs, it's usually not a good idea to OCR them all. For most casual home users, this likely isn't too big of an issue.

Again, I don't know what software is available in the linux world, but I figured it would be useful to know exactly what you're looking for. Hope it helps.
 
Old 07-23-2010, 11:24 AM   #4
Toonses82
Member
 
Registered: Sep 2004
Location: Olympia, WA, USA
Distribution: Linux Mint 16 Cinnamon
Posts: 117

Rep: Reputation: 15
I started another thread on a related topic because I'm looking for software to edit and merge PDFs. I'm not sure if it's relevant to you, but you can read it here.
 
Old 07-24-2010, 07:56 AM   #5
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713
Quote:
Originally Posted by Toonses82 View Post
A PDF is an image of a document so it's treated like a picture, not text.
I'm not convinced that's true.

In a PDF viewer, I can select and copy text.

Some PDFs have text, and some are just a big picture with no actual "text".
 
Old 07-24-2010, 12:44 PM   #6
Toonses82
Member
 
Registered: Sep 2004
Location: Olympia, WA, USA
Distribution: Linux Mint 16 Cinnamon
Posts: 117

Rep: Reputation: 15
I think you're talking about the difference between a text searchable PDF and a regular one. If you can highlight and copy text, then you can search the document as well.

I might be wrong. It's not like I invented PDF or something. I just work extensively with PDF and document management software in the enterprise environment, and it's my job to understand the format.
 
Old 07-24-2010, 02:02 PM   #7
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713Reputation: 713
Yes. The PDFs that contain actual text information can be searched, while te PDFs that just contain pictures of text, obviously can't.

Many PDFs are just photocopies of paper documents, and thus are just pictures.
 
Old 09-15-2014, 05:12 AM   #8
Linuxant
LQ Newbie
 
Registered: Apr 2013
Posts: 4

Rep: Reputation: Disabled
search text in pdf

It can be done via pdftotext then using grep
pdftotext document.pdf - | grep -C5 -n -i "search term"
http://askubuntu.com/questions/18458...m-command-line
for f in pdf_directory; do echo $f; pdftotext $f - | grep -i "search_term"; done

I found this command
spdf () { find . -name "*.pdf" -print0| while read -d $'' file; do co=$(pdftotext -q "$file" - |grep -c $1); if [ $co -ne 0 ]; then echo $co - "$file" ; fi ; done }

There is also pdfgrep command
http://manpages.ubuntu.com/manpages/...pdfgrep.1.html

Another software is named Recoll can help
http://xmodulo.com/2013/08/how-to-se...-on-linux.html

I can confirm some pdfs are images exported as pdf these are unsearchable.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to search pdf files? Doug Zhang Linux - Software 7 01-12-2010 10:00 AM
script to search inside list of files adam_blackice Programming 5 03-25-2008 09:35 AM
searching text inside .pdf/.chm files? paperplane Linux - Newbie 2 02-03-2008 07:31 AM
pdf search text question shogun1234 Linux - Software 4 09-19-2007 01:11 PM
Search for text inside files alaios Linux - Newbie 7 03-12-2006 09:20 AM


All times are GMT -5. The time now is 07:05 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration