LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 05-28-2014, 04:55 PM   #1
di11rod
Member
 
Registered: Jan 2004
Location: Austin, TEXAS
Distribution: CentOS 6.5
Posts: 211

Rep: Reputation: 32
Export PDF pages as individual text files?


I'm putting together a search engine using Apache Solr. I have a few dozen PDF documents that I want to break apart and export each page into an individual text file so that I can then write scripts to convert to XML files for submission to Solr for indexing.

Can anyone recommend a tool or script for exporting each page as an individual text file? If you have experience with indexing PDF files within Solr, then I'd also like comments about whether this is a good or bad way to approach indexing PDF-sourced content.

Appreciatively,

di11rod
 
Old 05-29-2014, 03:41 AM   #2
j-ray
Senior Member
 
Registered: Jan 2002
Location: germany
Distribution: ubuntu, mint, suse
Posts: 1,591

Rep: Reputation: 145Reputation: 145
pdf2txt is the tool you need. There may be others of course. Here you see what to do:

http://pnaplinux.blogspot.com/2008/1...o-pdf2txt.html
 
1 members found this post helpful.
Old 06-04-2014, 12:20 AM   #3
di11rod
Member
 
Registered: Jan 2004
Location: Austin, TEXAS
Distribution: CentOS 6.5
Posts: 211

Original Poster
Rep: Reputation: 32
Thumbs up

J-Ray,

That is perfect. I really appreciate you posting that link showing the different arguments, etc. It does illustrate the exact ways I intend to use the pdf2text tool.

Many thanks!!!!

di11rod
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there a way to export pdf files to mobi or epub without weird artifacts/characters linux_BSD Linux - Software 7 10-20-2012 06:01 PM
how to export djvu files to pdf files sagsriv Linux - Software 2 08-24-2008 11:23 AM
Cannot display all pages in pdf files ursahoribl Linux - Software 4 05-30-2006 06:35 PM
extract all the diagrams in a pdf file to individual graphics files on linux tcma Linux - Software 0 10-22-2004 01:52 PM
MAN Pages..can I export to a text file? phil1076 Linux - General 4 10-29-2001 08:11 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 09:31 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration