LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices

Reply
 
Search this Thread
Old 10-25-2012, 06:28 PM   #1
Woodsman
Senior Member
 
Registered: Oct 2005
Distribution: Slackware 14.1
Posts: 3,482

Rep: Reputation: 534Reputation: 534Reputation: 534Reputation: 534Reputation: 534Reputation: 534
OCR Software and Slackware


I'm looking for free/libre OCR software recommendations. I don't need scanner support at this time. I only need the ability to convert existing scanned images.

I'm looking for the big picture. For example, is unpaper helpful?

I get the impression that cuneiform or tesseract are the only credible engine options.

I need support for two-column text layouts. A GUI front-end probably is easier for that. YAGF?

At this stage I'm looking to convert the scanned images to text. Proofreading and editing come later.

Side question: although I have a flat bed scanner, I'm wondering whether a digital camera and tripod might be faster and provide higher resolution. Thoughts? Experience?
 
Old 10-25-2012, 08:01 PM   #2
lkraemer
Member
 
Registered: Aug 2008
Posts: 111

Rep: Reputation: 10
Woodsman,
I've tried Tesseract 3.0, and TextBridge Classic 2.0, and from my experience TextBridge Classic 2.0 works better. I have also used
unpaper, and it worked very well. I am runing TextBridge in Wine. I've also used convert to take my JPG Camera Images of OLD Cookbooks
and convert to a format that TextBridge/Tesseract and unpaper work with.

There is an OCR program in Irfanview's Plugin's, but it is a lot slower than TextBridge.

Somewhere on my other Hard Drive I've got a detailed txt document on the process I use. It's also on the following Forum's:
http://ubuntuforums.org/showthread.p...light=cookbook
http://forums.fedoraforum.org/showthread.php?t=255946
http://forums.fedoraforum.org/showthread.php?t=255875

I'll keep searching.....


Method One........
1. Take Pictures of the Cookbook. (Or Scan the Cookbook to BMP's)
These JPG's or BMP's will have two CookBook pages on each image.
2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper.
Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression
convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this
3. Use unpaper to make two pages of each PBM file.
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm
rm image*.pbm -- remove un-necessary files
4. Convert the PBM to a TIF for Tesseract OCR.
convert out001.pbm out001.tif -- for one file
for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files
rm out*.pbm -- remove un-necessary files
5. Use Tesseract OCR to create the text file.
tesseract out001.tif out001 -- for one file
for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files
for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files.
rm out*.tif -- remove un-necessary files
6. Create the Cookbook from all the converted pages.
cat out*.txt > CookBook.txt
rm out*.txt -- remove un-necessary files
7. Edit the CookBook Text file to correct the mistakes.



Script for processing Photo's taken of Cookbooks.

https://help.ubuntu.com/community/OCR

Code:
#!/bin/sh
PAGES=100 # set to the number of pages in the PDF
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)

touch $OUTPUT
for i in `seq 1 $PAGES`; do
    convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page$i.tif
    tesseract page$i.tif page$i
    cat $OUTPUT page$i.txt > temp.txt
    rm $OUTPUT
    rm page$i.tif
    rm page$i.txt
    mv temp.txt $OUTPUT
done



Method Two........
1. Take Pictures of the Cookbook. (Or Scan the Cookbook at 600 DPI to BMP's)
These JPG's or BMP's will have two CookBook pages on each image.
2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper.
Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression
convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this

for i in P10*.JPG; do convert -despeckle -monochrome $i ${i%.JPG}.pbm ; done -- for multiple files
This doesn't work as well as using Irfanview and running the Batch.

3. Use unpaper to make two pages of each PBM file.
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm
rm image*.pbm -- remove un-necessary files
4. Convert the PBM to a TIF for Tesseract OCR.
convert out001.pbm out001.tif -- for one file
for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files
rm out*.pbm -- remove un-necessary files
5. Use Tesseract OCR to create the text file.
tesseract out001.tif out001 -- for one file
for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files
for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files.
rm out*.tif -- remove un-necessary files
6. Create the Cookbook from all the converted pages.
cat out*.txt > CookBook.txt
rm out*.txt -- remove un-necessary files
7. Edit the CookBook Text file to correct the mistakes.



RAMBLING & Testing.................

convert P1020215.JPG -depth 8 lk001.pbm
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > lkcookbook.txt

convert P1020215.JPG -despeckle -depth 1 lk001.pbm
unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > lkcookbook.txt

BEST OUTPUT.
convert P1020215.JPG -despeckle -depth 1 lk001.pbm
unpaper --layout double --overwrite --output-pages 2 lk%03d.pbm lkout%03d.pbm
for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done
for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done
cat lkout*.txt > jscookbook1.txt

-colorspace Gray



convert P1020215.JPG -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -depth 1 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture
convert P1020215.JPG -despeckle -monochrome lk001.pbm -- Better picture
convert P1000460.JPG -despeckle -depth 8 -monochrome lk001.pbm -- Better picture

This should save you several hours of work.........

For some strange reason I can't get TextBridge Classic to run in Wine or Crossover in Slackware 14.


Thanks.

Larry
Attached Images
File Type: jpg P1020044.jpg (252.8 KB, 34 views)
Attached Files
File Type: txt out004.txt (1.1 KB, 16 views)

Last edited by lkraemer; 10-25-2012 at 09:26 PM.
 
2 members found this post helpful.
Old 10-25-2012, 08:33 PM   #3
bosth
Member
 
Registered: Apr 2011
Posts: 222

Rep: Reputation: 68
Cuneiform can do one trick that Tesseract can not: take an images-only PDF, run it through an OCR program and then reassemble the text and original images so that you get a text-searchable PDF. There's a few intermediary steps using other software, but it can all be nicely scripted.

I should have mentioned that you can also take a set of images and create a text-searchable PDF from scratch.

Last edited by bosth; 10-26-2012 at 10:00 AM.
 
Old 10-26-2012, 02:48 AM   #4
Alien Bob
Slackware Contributor
 
Registered: Sep 2005
Location: Eindhoven, The Netherlands
Distribution: Slackware
Posts: 5,257

Rep: Reputation: Disabled
This reminds me that I have to rebuild my scanning/OCR software packages (tesseract, cuneiform, ocropus, scantailor) and finally upload them. They should help you kickstart your OCR efforts.

Eric
 
1 members found this post helpful.
Old 10-26-2012, 02:51 AM   #5
metageek
Member
 
Registered: Jun 2007
Location: manchester, uk
Distribution: Slackware
Posts: 118

Rep: Reputation: 23
Quote:
Originally Posted by Woodsman View Post
Side question: although I have a flat bed scanner, I'm wondering whether a digital camera and tripod might be faster and provide higher resolution. Thoughts? Experience?
Digital camera and tripod will be faster than flat bed scanner and can be higher resolution. However there are issues with keeping the book flat and I have never managed to solve this adequately. Some people report using glass on top of the book, but you would have to be careful with reflection.

Of course, if you do not need the book again, you could remove the binding and then it would be easy to keep the pages flat...

Last edited by metageek; 10-26-2012 at 02:55 AM. Reason: added more info
 
Old 11-10-2012, 04:07 PM   #6
lkraemer
Member
 
Registered: Aug 2008
Posts: 111

Rep: Reputation: 10
Woodsman,
I installed tesseract 3.01 and repeated my tests. It looks as if any JPG (Camera MACRO Snapshot) can be easily converted to text.

The convert.png attached describes the settings I used. I can't figure out how to get TextBridge Classic 2.0 to
convert a BMP file. I've done it before, but need to stumble across my notes again.

Maybe this information will be of help to you.

Larry
Attached Images
File Type: png convert.png (45.7 KB, 29 views)
Attached Files
File Type: txt out.txt (3.4 KB, 5 views)

Last edited by lkraemer; 11-11-2012 at 03:08 PM.
 
Old 11-10-2012, 08:00 PM   #7
Woodsman
Senior Member
 
Registered: Oct 2005
Distribution: Slackware 14.1
Posts: 3,482

Original Poster
Rep: Reputation: 534Reputation: 534Reputation: 534Reputation: 534Reputation: 534Reputation: 534
Thanks for sharing. I haven't forgotten this thread --- I just haven't yet found time to do anything related to the topic.
 
Old 11-11-2012, 02:59 PM   #8
lkraemer
Member
 
Registered: Aug 2008
Posts: 111

Rep: Reputation: 10
Woodsman,
I finally got TextBridge Classic 2.0 to process a page, and it's OCR's output is compared with tesseract 3.01.
(I had to set the TIFF to BMP for 1 Bit Black versus 4 (4 * 3 = 12) or 8 Bits (24) per RBG colors.)

In my opinion TextBridge Classic 2.0 does a better job with the text conversion, but not on the layout as compared
to the original document.

Tesseract does a better job keeping the original layout for the processed text, but doesn't do as good when
converting (OCR) to text.

I scanned a Cookbook page at 300 DPI and greyscale. Then processed the Tiff to make a BMP for TextBridge,
using Irfanview ver 4.33 in Wine 1.5.5.

TextBridge Classic 2.0 was also running in Wine 1.5.5 on Slackware 14.


Your results may vary.

Thanks.

Larry
Attached Images
File Type: png bmp_setup.png (46.6 KB, 4 views)
Attached Files
File Type: txt conversion2.zip.txt (65.8 KB, 2 views)
File Type: txt README.txt (570 Bytes, 2 views)

Last edited by lkraemer; 11-11-2012 at 03:10 PM.
 
Old 11-17-2012, 09:16 PM   #9
RoyaleWitCheese
LQ Newbie
 
Registered: Feb 2012
Location: Acadia,New Brunswick, Canada
Distribution: Slackware64-current
Posts: 8

Rep: Reputation: Disabled
www.slackatomic.com
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
any good OCR software out there baronobeefdip Linux - Software 7 04-08-2011 04:43 PM
I need OCR software. damgar Linux - Software 10 09-30-2010 03:56 PM
Looking for a OCR software ufmale Linux - Software 1 10-13-2009 10:51 PM
How to add new font library to kooka ocr software shridhar005 Linux - Software 3 04-21-2009 02:54 PM
OCR initialization failed accessing OCR device: PROC-26 cheeku Linux - Software 0 09-19-2004 08:36 AM


All times are GMT -5. The time now is 10:54 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration