Woodsman,
I've tried Tesseract 3.0, and TextBridge Classic 2.0, and from my experience TextBridge Classic 2.0 works better. I have also used
unpaper, and it worked very well. I am runing TextBridge in Wine. I've also used convert to take my JPG Camera Images of OLD Cookbooks
and convert to a format that TextBridge/Tesseract and unpaper work with.
There is an OCR program in Irfanview's Plugin's, but it is a lot slower than TextBridge.
Somewhere on my other Hard Drive I've got a detailed txt document on the process I use. It's also on the following Forum's:
http://ubuntuforums.org/showthread.p...light=cookbook
http://forums.fedoraforum.org/showthread.php?t=255946
http://forums.fedoraforum.org/showthread.php?t=255875
I'll keep searching.....
Method One........
1. Take Pictures of the Cookbook. (Or Scan the Cookbook to BMP's)
These JPG's or BMP's will have two CookBook pages on each image.
2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper.
Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression
convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this
3. Use unpaper to make two pages of each PBM file.
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm
rm image*.pbm -- remove un-necessary files
4. Convert the PBM to a TIF for Tesseract OCR.
convert out001.pbm out001.tif -- for one file
for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files
rm out*.pbm -- remove un-necessary files
5. Use Tesseract OCR to create the text file.
tesseract out001.tif out001 -- for one file
for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files
for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files.
rm out*.tif -- remove un-necessary files
6. Create the Cookbook from all the converted pages.
cat out*.txt > CookBook.txt
rm out*.txt -- remove un-necessary files
7. Edit the CookBook Text file to correct the mistakes.
Script for processing Photo's taken of Cookbooks.
https://help.ubuntu.com/community/OCR
Code:
#!/bin/sh
PAGES=100 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
touch $OUTPUT
for i in `seq 1 $PAGES`; do
convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page$i.tif
tesseract page$i.tif page$i
cat $OUTPUT page$i.txt > temp.txt
rm $OUTPUT
rm page$i.tif
rm page$i.txt
mv temp.txt $OUTPUT
done
Method Two........
1. Take Pictures of the Cookbook. (Or Scan the Cookbook at 600 DPI to BMP's)
These JPG's or BMP's will have two CookBook pages on each image.
2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper.
Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression
convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this
for i in P10*.JPG; do convert -despeckle -monochrome $i ${i%.JPG}.pbm ; done -- for multiple files
This doesn't work as well as using Irfanview and running the Batch.
3. Use unpaper to make two pages of each PBM file.
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm
rm image*.pbm -- remove un-necessary files
4. Convert the PBM to a TIF for Tesseract OCR.
convert out001.pbm out001.tif -- for one file
for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files
rm out*.pbm -- remove un-necessary files
5. Use Tesseract OCR to create the text file.
tesseract out001.tif out001 -- for one file
for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files
for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files.
rm out*.tif -- remove un-necessary files
6. Create the Cookbook from all the converted pages.
cat out*.txt > CookBook.txt
rm out*.txt -- remove un-necessary files
7. Edit the CookBook Text file to correct the mistakes.
RAMBLING & Testing.................
convert P1020215.JPG -depth 8 lk001.pbm
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > lkcookbook.txt
convert P1020215.JPG -despeckle -depth 1 lk001.pbm
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > lkcookbook.txt
BEST OUTPUT.
convert P1020215.JPG -despeckle -depth 1 lk001.pbm
unpaper --layout double --overwrite --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > jscookbook1.txt
-colorspace Gray
convert P1020215.JPG -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -depth 1 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -monochrome lk001.pbm -- Better picture
convert P1000460.JPG -despeckle -depth 8 -monochrome lk001.pbm -- Better picture
This should save you several hours of work.........
For some strange reason I can't get TextBridge Classic to run in Wine or Crossover in Slackware 14.
Thanks.
Larry