LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-12-2011, 04:13 AM   #1
dimothy
LQ Newbie
 
Registered: Jun 2010
Posts: 10

Rep: Reputation: 0
OCR Server scripting


I have had some success from all you wonderful people here with some of my fiarly noob like script questions and I was hoping to call on you all again!

Basically I am trying to setup an OCR server that takes scanned PDFs and then OCRs them and spits out searchable PDFs the other end. After much trial an tribulation I have managed to get the various parts on their own working. The difficulty I am having is trying to address multiple filenames and extensions in to the for loop for the hocr2pdf command. The syntax for the command is:
Code:
hocr2pdf -i [TIFF IMAGE] -o [OUTPUT PDF] < [HOCR HTML FILE]
At present each PDF is split in to single page tiffs and tesseracted (ocrd)with hocr output. This leaves me with a load of tiffs (called ocrbook-n.tif) and html files (called ocrbook-n.html). n is the corresponding page number. The script I have so far is below:

Code:
#!/bin/sh
mkdir tmp
file=$1
cp ${file} tmp
cd tmp
pdftoppm * -f 1 -l 10 -r 600 ocrbook
for i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; done
for i in *.tif; do tesseract "$i" "`basename "$i" .tif`" -l eng +/home/talexander/hocr.txt; done
for i in ocrbook-*.tif
        do
        base="${ocrbook%.tif}"
        hocr2pdf -i "$base.tiff" -o "$base.pdf" < "$base.html"
                done
I have tried to cobble together various suggestions from other peoples findings online but I feel the end result is a melting pot of errors and incorrect/misunderstood ideas. Can anyone help with, what I suppose would be, a procedural march through the ascending tiffs and their corresponding html files?


P.S. The output I get at present is this:

Code:
talexander@alfocr:~$ ./ocrd.sh test.pdf
Tesseract Open Source OCR Engine v3.01 with Leptonica
Tesseract Open Source OCR Engine v3.01 with Leptonica
Tesseract Open Source OCR Engine v3.01 with Leptonica
Tesseract Open Source OCR Engine v3.01 with Leptonica
Tesseract Open Source OCR Engine v3.01 with Leptonica
Tesseract Open Source OCR Engine v3.01 with Leptonica
Tesseract Open Source OCR Engine v3.01 with Leptonica
./ocrd.sh: 13: cannot open .html: No such file
./ocrd.sh: 13: cannot open .html: No such file
./ocrd.sh: 13: cannot open .html: No such file
./ocrd.sh: 13: cannot open .html: No such file
./ocrd.sh: 13: cannot open .html: No such file
./ocrd.sh: 13: cannot open .html: No such file
./ocrd.sh: 13: cannot open .html: No such file

Last edited by dimothy; 08-12-2011 at 05:39 AM.
 
Old 08-12-2011, 09:34 AM   #2
dimothy
LQ Newbie
 
Registered: Jun 2010
Posts: 10

Original Poster
Rep: Reputation: 0
Well I came across this at StackOverflow but am still unsure as to how it applies.

"First, get file without path:

Code:
filename=$(basename $fullfile)
extension=${filename##*.}
filename=${filename%.*}
"

From what I understand I would be creating a variable that I can call called "filename". What I am unsure of is what id $fullfile? Is this a built in way to reference the file or is this a variable I would have had to have specified earlier in the script? Once I have a variable called $filename I assume the use of % and # are the method in bash to break down its constituent parts?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
OCR Pedroski Linux - Software 5 02-06-2010 11:56 PM
OCR abdoh Linux - Newbie 3 06-27-2009 11:41 PM
ocr John Master Linux - Software 7 06-12-2005 05:56 PM
Ocr apffal Linux - Software 1 06-12-2005 05:01 AM
OCR initialization failed accessing OCR device: PROC-26 cheeku Linux - Software 0 09-19-2004 08:36 AM


All times are GMT -5. The time now is 07:49 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration