![]() |
OCR Software and Slackware
I'm looking for free/libre OCR software recommendations. I don't need scanner support at this time. I only need the ability to convert existing scanned images.
I'm looking for the big picture. For example, is unpaper helpful? I get the impression that cuneiform or tesseract are the only credible engine options. I need support for two-column text layouts. A GUI front-end probably is easier for that. :) YAGF? At this stage I'm looking to convert the scanned images to text. Proofreading and editing come later.:) Side question: although I have a flat bed scanner, I'm wondering whether a digital camera and tripod might be faster and provide higher resolution. Thoughts? Experience? |
2 Attachment(s)
Woodsman,
I've tried Tesseract 3.0, and TextBridge Classic 2.0, and from my experience TextBridge Classic 2.0 works better. I have also used unpaper, and it worked very well. I am runing TextBridge in Wine. I've also used convert to take my JPG Camera Images of OLD Cookbooks and convert to a format that TextBridge/Tesseract and unpaper work with. There is an OCR program in Irfanview's Plugin's, but it is a lot slower than TextBridge. Somewhere on my other Hard Drive I've got a detailed txt document on the process I use. It's also on the following Forum's: http://ubuntuforums.org/showthread.p...light=cookbook http://forums.fedoraforum.org/showthread.php?t=255946 http://forums.fedoraforum.org/showthread.php?t=255875 I'll keep searching..... Method One........ 1. Take Pictures of the Cookbook. (Or Scan the Cookbook to BMP's) These JPG's or BMP's will have two CookBook pages on each image. 2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper. Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this 3. Use unpaper to make two pages of each PBM file. unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm rm image*.pbm -- remove un-necessary files 4. Convert the PBM to a TIF for Tesseract OCR. convert out001.pbm out001.tif -- for one file for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files rm out*.pbm -- remove un-necessary files 5. Use Tesseract OCR to create the text file. tesseract out001.tif out001 -- for one file for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files. rm out*.tif -- remove un-necessary files 6. Create the Cookbook from all the converted pages. cat out*.txt > CookBook.txt rm out*.txt -- remove un-necessary files 7. Edit the CookBook Text file to correct the mistakes. Script for processing Photo's taken of Cookbooks. https://help.ubuntu.com/community/OCR Code:
#!/bin/shMethod Two........ 1. Take Pictures of the Cookbook. (Or Scan the Cookbook at 600 DPI to BMP's) These JPG's or BMP's will have two CookBook pages on each image. 2. Convert the JPG's/BMP's to PBM's to allow making separate pages with unpaper. Under ADVANCED select 1 Bit Black, NO Dithering, and NO Compression convert P102004043.jpg image001.pbm -- or use Irfanview's Batch Convert for this for i in P10*.JPG; do convert -despeckle -monochrome $i ${i%.JPG}.pbm ; done -- for multiple files This doesn't work as well as using Irfanview and running the Batch. 3. Use unpaper to make two pages of each PBM file. unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 image%03d.pbm out%03d.pbm rm image*.pbm -- remove un-necessary files 4. Convert the PBM to a TIF for Tesseract OCR. convert out001.pbm out001.tif -- for one file for i in `ls out*.pbm`; do convert $i ${i%.pbm}.tif ; done -- for multiple files rm out*.pbm -- remove un-necessary files 5. Use Tesseract OCR to create the text file. tesseract out001.tif out001 -- for one file for i in `ls out*`; do tesseract $i ${i%.tif} ; done -- for multiple files for i in $(ls out*.tif) ; do tesseract $i ${i%.tif} ; done -- for multiple files. rm out*.tif -- remove un-necessary files 6. Create the Cookbook from all the converted pages. cat out*.txt > CookBook.txt rm out*.txt -- remove un-necessary files 7. Edit the CookBook Text file to correct the mistakes. RAMBLING & Testing................. convert P1020215.JPG -depth 8 lk001.pbm unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done cat lkout*.txt > lkcookbook.txt convert P1020215.JPG -despeckle -depth 1 lk001.pbm unpaper --layout double --overwrite --deskew-scan-range 10 --output-pages 2 lk%03d.pbm lkout%03d.pbm for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done cat lkout*.txt > lkcookbook.txt BEST OUTPUT. convert P1020215.JPG -despeckle -depth 1 lk001.pbm unpaper --layout double --overwrite --output-pages 2 lk%03d.pbm lkout%03d.pbm for i in `ls lkout*.pbm`; do convert $i ${i%.pbm}.tif ; done for i in $(ls lkout*.tif) ; do tesseract $i ${i%.tif} ; done cat lkout*.txt > jscookbook1.txt -colorspace Gray convert P1020215.JPG -depth 8 lk001.pbm -- not good picture convert P1020215.JPG -depth 1 lk001.pbm -- not good picture convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture convert P1020215.JPG -despeckle -depth 8 lk001.pbm -- not good picture convert P1020215.JPG -despeckle -monochrome lk001.pbm -- Better picture convert P1000460.JPG -despeckle -depth 8 -monochrome lk001.pbm -- Better picture This should save you several hours of work......... For some strange reason I can't get TextBridge Classic to run in Wine or Crossover in Slackware 14. Thanks. Larry |
Cuneiform can do one trick that Tesseract can not: take an images-only PDF, run it through an OCR program and then reassemble the text and original images so that you get a text-searchable PDF. There's a few intermediary steps using other software, but it can all be nicely scripted.
I should have mentioned that you can also take a set of images and create a text-searchable PDF from scratch. |
This reminds me that I have to rebuild my scanning/OCR software packages (tesseract, cuneiform, ocropus, scantailor) and finally upload them. They should help you kickstart your OCR efforts.
Eric |
Quote:
Of course, if you do not need the book again, you could remove the binding and then it would be easy to keep the pages flat... |
2 Attachment(s)
Woodsman,
I installed tesseract 3.01 and repeated my tests. It looks as if any JPG (Camera MACRO Snapshot) can be easily converted to text. The convert.png attached describes the settings I used. I can't figure out how to get TextBridge Classic 2.0 to convert a BMP file. I've done it before, but need to stumble across my notes again. Maybe this information will be of help to you. Larry |
Thanks for sharing. I haven't forgotten this thread --- I just haven't yet found time to do anything related to the topic. :)
|
3 Attachment(s)
Woodsman,
I finally got TextBridge Classic 2.0 to process a page, and it's OCR's output is compared with tesseract 3.01. (I had to set the TIFF to BMP for 1 Bit Black versus 4 (4 * 3 = 12) or 8 Bits (24) per RBG colors.) In my opinion TextBridge Classic 2.0 does a better job with the text conversion, but not on the layout as compared to the original document. Tesseract does a better job keeping the original layout for the processed text, but doesn't do as good when converting (OCR) to text. I scanned a Cookbook page at 300 DPI and greyscale. Then processed the Tiff to make a BMP for TextBridge, using Irfanview ver 4.33 in Wine 1.5.5. TextBridge Classic 2.0 was also running in Wine 1.5.5 on Slackware 14. Your results may vary. Thanks. Larry |
|
| All times are GMT -5. The time now is 01:17 PM. |