OCR Software and Slackware

Woodsman · 10-25-2012, 06:28 PM

I'm looking for free/libre OCR software recommendations. I don't need scanner support at this time. I only need the ability to convert existing scanned images.

I'm looking for the big picture. For example, is unpaper helpful?

I get the impression that cuneiform or tesseract are the only credible engine options.

I need support for two-column text layouts. A GUI front-end probably is easier for that.

YAGF?

At this stage I'm looking to convert the scanned images to text. Proofreading and editing come later.

Side question: although I have a flat bed scanner, I'm wondering whether a digital camera and tripod might be faster and provide higher resolution. Thoughts? Experience?

lkraemer · 10-25-2012, 08:01 PM

Woodsman,
I've tried Tesseract 3.0, and TextBridge Classic 2.0, and from my experience TextBridge Classic 2.0 works better. I have also used
unpaper, and it worked very well. I am runing TextBridge in Wine. I've also used convert to take my JPG Camera Images of OLD Cookbooks
and convert to a format that TextBridge/Tesseract and unpaper work with.

There is an OCR program in Irfanview's Plugin's, but it is a lot slower than TextBridge.

Somewhere on my other Hard Drive I've got a detailed txt document on the process I use. It's also on the following Forum's:
http://ubuntuforums.org/showthread.p...light=cookbook
http://forums.fedoraforum.org/showthread.php?t=255946
http://forums.fedoraforum.org/showthread.php?t=255875

I'll keep searching.....

Method One........
1. Take Pictures of the Cookbook. (Or Scan the Cookbook to BMP's)
These JPG's or BMP's will have two CookBook pages on each image.
2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper.
Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression
convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this
3. Use unpaper to make two pages of each PBM file.
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm
rm image*.pbm -- remove un-necessary files
4. Convert the PBM to a TIF for Tesseract OCR.
convert out001.pbm out001.tif -- for one file
for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files
rm out*.pbm -- remove un-necessary files
5. Use Tesseract OCR to create the text file.
tesseract out001.tif out001 -- for one file
for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files
for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files.
rm out*.tif -- remove un-necessary files
6. Create the Cookbook from all the converted pages.
cat out*.txt > CookBook.txt
rm out*.txt -- remove un-necessary files
7. Edit the CookBook Text file to correct the mistakes.

Script for processing Photo's taken of Cookbooks.

https://help.ubuntu.com/community/OCR

Code:

#!/bin/sh
PAGES=100 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)

touch $OUTPUT
for i in `seq 1 $PAGES`; do
    convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page$i.tif
    tesseract page$i.tif page$i
    cat $OUTPUT page$i.txt > temp.txt
    rm $OUTPUT
    rm page$i.tif
    rm page$i.txt
    mv temp.txt $OUTPUT
done

Method Two........
1. Take Pictures of the Cookbook. (Or Scan the Cookbook at 600 DPI to BMP's)
These JPG's or BMP's will have two CookBook pages on each image.
2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper.
Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression
convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this

for i in P10*.JPG; do convert -despeckle -monochrome $i ${i%.JPG}.pbm ; done -- for multiple files
This doesn't work as well as using Irfanview and running the Batch.

3. Use unpaper to make two pages of each PBM file.
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm
rm image*.pbm -- remove un-necessary files
4. Convert the PBM to a TIF for Tesseract OCR.
convert out001.pbm out001.tif -- for one file
for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files
rm out*.pbm -- remove un-necessary files
5. Use Tesseract OCR to create the text file.
tesseract out001.tif out001 -- for one file
for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files
for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files.
rm out*.tif -- remove un-necessary files
6. Create the Cookbook from all the converted pages.
cat out*.txt > CookBook.txt
rm out*.txt -- remove un-necessary files
7. Edit the CookBook Text file to correct the mistakes.

RAMBLING & Testing.................

convert P1020215.JPG -depth 8 lk001.pbm
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > lkcookbook.txt

convert P1020215.JPG -despeckle -depth 1 lk001.pbm
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > lkcookbook.txt

BEST OUTPUT.
convert P1020215.JPG -despeckle -depth 1 lk001.pbm
unpaper --layout double --overwrite --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > jscookbook1.txt

-colorspace Gray

convert P1020215.JPG -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -depth 1 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -monochrome lk001.pbm -- Better picture
convert P1000460.JPG -despeckle -depth 8 -monochrome lk001.pbm -- Better picture

This should save you several hours of work.........

For some strange reason I can't get TextBridge Classic to run in Wine or Crossover in Slackware 14.

Thanks.

Larry

bosth · 10-25-2012, 08:33 PM

Cuneiform can do one trick that Tesseract can not: take an images-only PDF, run it through an OCR program and then reassemble the text and original images so that you get a text-searchable PDF. There's a few intermediary steps using other software, but it can all be nicely scripted.

I should have mentioned that you can also take a set of images and create a text-searchable PDF from scratch.

Alien Bob · 10-26-2012, 02:48 AM

This reminds me that I have to rebuild my scanning/OCR software packages (tesseract, cuneiform, ocropus, scantailor) and finally upload them. They should help you kickstart your OCR efforts.

Eric

metageek · 10-26-2012, 02:51 AM

Quote:

Originally Posted by Woodsman

Side question: although I have a flat bed scanner, I'm wondering whether a digital camera and tripod might be faster and provide higher resolution. Thoughts? Experience?

Digital camera and tripod will be faster than flat bed scanner and can be higher resolution. However there are issues with keeping the book flat and I have never managed to solve this adequately. Some people report using glass on top of the book, but you would have to be careful with reflection.

Of course, if you do not need the book again, you could remove the binding and then it would be easy to keep the pages flat...

lkraemer · 11-10-2012, 04:07 PM

Woodsman,
I installed tesseract 3.01 and repeated my tests. It looks as if any JPG (Camera MACRO Snapshot) can be easily converted to text.

The convert.png attached describes the settings I used. I can't figure out how to get TextBridge Classic 2.0 to
convert a BMP file. I've done it before, but need to stumble across my notes again.

Maybe this information will be of help to you.

Larry

Woodsman · 11-10-2012, 08:00 PM

Thanks for sharing. I haven't forgotten this thread --- I just haven't yet found time to do anything related to the topic.

lkraemer · 11-11-2012, 02:59 PM

Woodsman,
I finally got TextBridge Classic 2.0 to process a page, and it's OCR's output is compared with tesseract 3.01.
(I had to set the TIFF to BMP for 1 Bit Black versus 4 (4 * 3 = 12) or 8 Bits (24) per RBG colors.)

In my opinion TextBridge Classic 2.0 does a better job with the text conversion, but not on the layout as compared
to the original document.

Tesseract does a better job keeping the original layout for the processed text, but doesn't do as good when
converting (OCR) to text.

I scanned a Cookbook page at 300 DPI and greyscale. Then processed the Tiff to make a BMP for TextBridge,
using Irfanview ver 4.33 in Wine 1.5.5.

TextBridge Classic 2.0 was also running in Wine 1.5.5 on Slackware 14.

Your results may vary.

Thanks.

Larry

RoyaleWitCheese · 11-17-2012, 09:16 PM

www.slackatomic.com