LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Tesseract - Bulk OCR whole folder of read only pdfs (https://www.linuxquestions.org/questions/linux-software-2/tesseract-bulk-ocr-whole-folder-of-read-only-pdfs-4175426045/)

forbinproject 09-06-2012 08:39 PM

Tesseract - Bulk OCR whole folder of read only pdfs
 
I have a large number of student essays that I have archived on a CDR in pdf format (read only). I mainly to be able to use the text directly in order to make grammar examples and also to simply remove names, so my current students can read them (without disclosing whose essays they originally were). Yes, I've used tesseract before and had reasonably good results, but this was with fresh scans, so I had the option to save them as tiff (Tesseract only works with uncompressed tiff files).

Question 1: What would be the best solution for converting a bunch of read only pdfs into tiff files?

Question 2: Is there a way to bulk convert a whole folder of pdfs to tiffs--leaving them with the same base name, but now as .tiff?

Question 3: Is there a way to bulk OCR a whole folder of the newly-made tiffs to yield the same base name (except, of course, with the results becoming .txt)?

Please let me know! I'm open to any suggestions and ideas here.

fakie_flip 09-06-2012 11:17 PM

1) http://xmodulo.blogspot.com/2012/06/...df-format.html

2) cd to the directory

for x in *pdf; do tiff2pdf > $x.tiff; done

rename .tiff.pdf .pdf file.tiff.pdf

3) Not sure if I understand your question. I'll let somebody else answer it.


All times are GMT -5. The time now is 04:54 PM.