LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 10-16-2013, 08:44 PM   #1
SpitFU
LQ Newbie
 
Registered: Oct 2013
Posts: 1

Rep: Reputation: Disabled
Seeking PDF library solution


Hi Folks,

I'm seeking a linux software solution to something I'd like to do. I run ubuntu but I'm not tied to that or at least I can compile what I need. I have about 1200 pdf documents that are scans of a non-profits brochures and other miscellaneous publications. The are a mix of text and images but all have text. I'd like to see if there is a software that can ingest the documents, create a searchable index and serve up the documents over a local network to pc's, and android tablets, perhaps even android phones. By serve it doesn't need to render them it can pipe the data raw and I can use a local clients software to view the documents. I've worked with commercial search engines and repositories before like Verity and Autonomy and they would work well but I'd like to find and open source solution to this because this is for personal use not business. Just thinking of doing this local on a headless ubuntu server and no need for it to be accessible over the web.

Thanks in advance for your suggestions.
Spit
 
Old 10-17-2013, 05:25 AM   #2
Pastychomper
Member
 
Registered: Sep 2011
Location: Scotland
Distribution: Slackware, Devuan, Android
Posts: 132

Rep: Reputation: 243Reputation: 243Reputation: 243
You could try Recoll or Calibre.

Calibre does a good job of indexing metadata, and can serve its library to remote clients. It doesn't do full-text search natively, but here is a plugin (which I haven't tried) intended to do just that, though it looks like it takes a bit of work to get it going.

The plugin uses Recoll, which - from a quick look at its home page - looks like it should be able to do the job on its own.
 
Old 10-17-2013, 08:10 AM   #3
michaelk
Moderator
 
Registered: Aug 2002
Posts: 25,699

Rep: Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895
I would assume that the PDFs are images which means the text is also just graphics and not searchable. I have not played with recoll but it appears to use pdftotext which will not work for text as part of an image. There is OCR software but depending on the quality of the scanned documents and type of text could be a lot of work.


https://help.ubuntu.com/community/OCR

Some more suggestions for search tools.
http://www.linux.com/news/software/a...gines-compared
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Split-library solution? sljunkie Linux - Software 8 09-21-2012 05:28 AM
Scanned Pdf Watermark-stamp solution ? chefarov Linux - Software 3 06-17-2011 12:55 PM
SOLUTION for extracting pdf from e-mail Demerzel Programming 3 08-09-2010 12:31 PM
n00b Seeking an Advanced File Management Solution 0FR Linux - Software 6 09-13-2009 01:29 PM
ps/pdf to djvu optimal conversion [solution found] H_TeXMeX_H Linux - Software 4 01-21-2009 11:31 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:37 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration