Latest LQ Deal: Latest LQ Deals
Go Back > Forums > Linux Forums > Linux - Software
User Name
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.


  Search this Thread
Old 09-06-2012, 09:39 PM   #1
LQ Newbie
Registered: Nov 2011
Distribution: Debian Testing (Wheezy)
Posts: 22
Blog Entries: 1

Rep: Reputation: Disabled
Tesseract - Bulk OCR whole folder of read only pdfs

I have a large number of student essays that I have archived on a CDR in pdf format (read only). I mainly to be able to use the text directly in order to make grammar examples and also to simply remove names, so my current students can read them (without disclosing whose essays they originally were). Yes, I've used tesseract before and had reasonably good results, but this was with fresh scans, so I had the option to save them as tiff (Tesseract only works with uncompressed tiff files).

Question 1: What would be the best solution for converting a bunch of read only pdfs into tiff files?

Question 2: Is there a way to bulk convert a whole folder of pdfs to tiffs--leaving them with the same base name, but now as .tiff?

Question 3: Is there a way to bulk OCR a whole folder of the newly-made tiffs to yield the same base name (except, of course, with the results becoming .txt)?

Please let me know! I'm open to any suggestions and ideas here.
Old 09-07-2012, 12:17 AM   #2
Senior Member
Registered: Feb 2005
Location: San Antonio, Texas
Distribution: Gentoo Hardened using OpenRC not Systemd
Posts: 1,495

Rep: Reputation: 85

2) cd to the directory

for x in *pdf; do tiff2pdf > $x.tiff; done

rename .tiff.pdf .pdf file.tiff.pdf

3) Not sure if I understand your question. I'll let somebody else answer it.
1 members found this post helpful.


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: gImageReader (Tesseract OCR GUI) Gets Multipage Recognition Support LXer Syndicated Linux News 0 03-25-2011 06:12 PM
LXer: Extract Text From PDFs And Images With gImageReader, A Tesseract OCR GUI LXer Syndicated Linux News 0 01-04-2011 10:00 AM
LXer: Optical Character Recognition With Tesseract OCR On Ubuntu 7.04 LXer Syndicated Linux News 0 08-30-2007 07:30 PM
OCR & Tesseract...Anyone tried it ? 2GNUBY Linux - Desktop 0 10-10-2006 04:39 PM
LXer: Google's Tesseract OCR engine is a quantum leap forward LXer Syndicated Linux News 0 09-28-2006 02:54 PM > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:05 PM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration