LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (http://www.linuxquestions.org/questions/slackware-14/)
-   -   OCR Software and Slackware (http://www.linuxquestions.org/questions/slackware-14/ocr-software-and-slackware-4175434104/)

Woodsman 10-25-2012 07:28 PM

OCR Software and Slackware
 
I'm looking for free/libre OCR software recommendations. I don't need scanner support at this time. I only need the ability to convert existing scanned images.

I'm looking for the big picture. For example, is unpaper helpful?

I get the impression that cuneiform or tesseract are the only credible engine options.

I need support for two-column text layouts. A GUI front-end probably is easier for that. :) YAGF?

At this stage I'm looking to convert the scanned images to text. Proofreading and editing come later.:)

Side question: although I have a flat bed scanner, I'm wondering whether a digital camera and tripod might be faster and provide higher resolution. Thoughts? Experience?

lkraemer 10-25-2012 09:01 PM

2 Attachment(s)
Woodsman,
I've tried Tesseract 3.0, and TextBridge Classic 2.0, and from my experience TextBridge Classic 2.0 works better. I have also used
unpaper, and it worked very well. I am runing TextBridge in Wine. I've also used convert to take my JPG Camera Images of OLD Cookbooks
and convert to a format that TextBridge/Tesseract and unpaper work with.

There is an OCR program in Irfanview's Plugin's, but it is a lot slower than TextBridge.

Somewhere on my other Hard Drive I've got a detailed txt document on the process I use. It's also on the following Forum's:
http://ubuntuforums.org/showthread.p...light=cookbook
http://forums.fedoraforum.org/showthread.php?t=255946
http://forums.fedoraforum.org/showthread.php?t=255875

I'll keep searching.....


Method One........
1. Take Pictures of the Cookbook. (Or Scan the Cookbook to BMP's)
These JPG's or BMP's will have two CookBook pages on each image.
2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper.
Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression
convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this
3. Use unpaper to make two pages of each PBM file.
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm
rm image*.pbm -- remove un-necessary files
4. Convert the PBM to a TIF for Tesseract OCR.
convert out001.pbm out001.tif -- for one file
for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files
rm out*.pbm -- remove un-necessary files
5. Use Tesseract OCR to create the text file.
tesseract out001.tif out001 -- for one file
for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files
for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files.
rm out*.tif -- remove un-necessary files
6. Create the Cookbook from all the converted pages.
cat out*.txt > CookBook.txt
rm out*.txt -- remove un-necessary files
7. Edit the CookBook Text file to correct the mistakes.



Script for processing Photo's taken of Cookbooks.

https://help.ubuntu.com/community/OCR

Code:

#!/bin/sh
PAGES=100 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)

touch $OUTPUT
for i in `seq 1 $PAGES`; do
    convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page$i.tif
    tesseract page$i.tif page$i
    cat $OUTPUT page$i.txt > temp.txt
    rm $OUTPUT
    rm page$i.tif
    rm page$i.txt
    mv temp.txt $OUTPUT
done




Method Two........
1. Take Pictures of the Cookbook. (Or Scan the Cookbook at 600 DPI to BMP's)
These JPG's or BMP's will have two CookBook pages on each image.
2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper.
Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression
convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this

for i in P10*.JPG; do convert -despeckle -monochrome $i ${i%.JPG}.pbm ; done -- for multiple files
This doesn't work as well as using Irfanview and running the Batch.

3. Use unpaper to make two pages of each PBM file.
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm
rm image*.pbm -- remove un-necessary files
4. Convert the PBM to a TIF for Tesseract OCR.
convert out001.pbm out001.tif -- for one file
for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files
rm out*.pbm -- remove un-necessary files
5. Use Tesseract OCR to create the text file.
tesseract out001.tif out001 -- for one file
for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files
for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files.
rm out*.tif -- remove un-necessary files
6. Create the Cookbook from all the converted pages.
cat out*.txt > CookBook.txt
rm out*.txt -- remove un-necessary files
7. Edit the CookBook Text file to correct the mistakes.



RAMBLING & Testing.................

convert P1020215.JPG -depth 8 lk001.pbm
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > lkcookbook.txt

convert P1020215.JPG -despeckle -depth 1 lk001.pbm
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > lkcookbook.txt

BEST OUTPUT.
convert P1020215.JPG -despeckle -depth 1 lk001.pbm
unpaper --layout double --overwrite --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > jscookbook1.txt

-colorspace Gray



convert P1020215.JPG -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -depth 1 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -monochrome lk001.pbm -- Better picture
convert P1000460.JPG -despeckle -depth 8 -monochrome lk001.pbm -- Better picture

This should save you several hours of work.........

For some strange reason I can't get TextBridge Classic to run in Wine or Crossover in Slackware 14.


Thanks.

Larry

bosth 10-25-2012 09:33 PM

Cuneiform can do one trick that Tesseract can not: take an images-only PDF, run it through an OCR program and then reassemble the text and original images so that you get a text-searchable PDF. There's a few intermediary steps using other software, but it can all be nicely scripted.

I should have mentioned that you can also take a set of images and create a text-searchable PDF from scratch.

Alien Bob 10-26-2012 03:48 AM

This reminds me that I have to rebuild my scanning/OCR software packages (tesseract, cuneiform, ocropus, scantailor) and finally upload them. They should help you kickstart your OCR efforts.

Eric

metageek 10-26-2012 03:51 AM

Quote:

Originally Posted by Woodsman (Post 4815021)
Side question: although I have a flat bed scanner, I'm wondering whether a digital camera and tripod might be faster and provide higher resolution. Thoughts? Experience?

Digital camera and tripod will be faster than flat bed scanner and can be higher resolution. However there are issues with keeping the book flat and I have never managed to solve this adequately. Some people report using glass on top of the book, but you would have to be careful with reflection.

Of course, if you do not need the book again, you could remove the binding and then it would be easy to keep the pages flat...

lkraemer 11-10-2012 05:07 PM

2 Attachment(s)
Woodsman,
I installed tesseract 3.01 and repeated my tests. It looks as if any JPG (Camera MACRO Snapshot) can be easily converted to text.

The convert.png attached describes the settings I used. I can't figure out how to get TextBridge Classic 2.0 to
convert a BMP file. I've done it before, but need to stumble across my notes again.

Maybe this information will be of help to you.

Larry

Woodsman 11-10-2012 09:00 PM

Thanks for sharing. I haven't forgotten this thread --- I just haven't yet found time to do anything related to the topic. :)

lkraemer 11-11-2012 03:59 PM

3 Attachment(s)
Woodsman,
I finally got TextBridge Classic 2.0 to process a page, and it's OCR's output is compared with tesseract 3.01.
(I had to set the TIFF to BMP for 1 Bit Black versus 4 (4 * 3 = 12) or 8 Bits (24) per RBG colors.)

In my opinion TextBridge Classic 2.0 does a better job with the text conversion, but not on the layout as compared
to the original document.

Tesseract does a better job keeping the original layout for the processed text, but doesn't do as good when
converting (OCR) to text.

I scanned a Cookbook page at 300 DPI and greyscale. Then processed the Tiff to make a BMP for TextBridge,
using Irfanview ver 4.33 in Wine 1.5.5.

TextBridge Classic 2.0 was also running in Wine 1.5.5 on Slackware 14.


Your results may vary.

Thanks.

Larry

RoyaleWitCheese 11-17-2012 10:16 PM

www.slackatomic.com


All times are GMT -5. The time now is 07:27 PM.