Export PDF pages as individual text files?
I'm putting together a search engine using Apache Solr. I have a few dozen PDF documents that I want to break apart and export each page into an individual text file so that I can then write scripts to convert to XML files for submission to Solr for indexing.
Can anyone recommend a tool or script for exporting each page as an individual text file? If you have experience with indexing PDF files within Solr, then I'd also like comments about whether this is a good or bad way to approach indexing PDF-sourced content.
Appreciatively,
di11rod
|