LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   grscan2pdf - does an image at the top of a document prevent OCR? (https://www.linuxquestions.org/questions/linux-software-2/grscan2pdf-does-an-image-at-the-top-of-a-document-prevent-ocr-845834/)

taylorkh 11-22-2010 11:24 AM

grscan2pdf - does an image at the top of a document prevent OCR?
 
Just installed gscan2pdf 0.9.20 on Ubuntu 10.04. It is hooked up to a Brother MFC 240c scanner. Scans fine with xsane. It also scans fine with gscan2pdf. I started with a document I printed to my laser printer - B&W at 300 dpi. GOCR produced very poor results. I switched to Tesseract and the OCR was 100% :D

Then I tried to scan the document I was actually interested in converting to a searchable PDF. NOTHING! The only differences I see between the two documents are as follows:

- document 2 has an image (State seal) at the top
- document 2 has various sizes of type
- document 2 has a signature in ink

I would have expected the program to at least OCR some of the document. It should have been able to find SOME text.

I have tried increasing the resolution of the scan to 400 then 600 dpi. No help. Set it to 1200 dpi - still waiting for the OCR to run.

I am at a loss. Any suggestions?

TIA,

Ken

p.s. I have an OLD version of Omnipage. I guess I will dig it out and install it on a VMWare XP guest :tisk:

p.p.s. The 1200 DPI OCR just ran - nothing.

taylorkh 11-22-2010 12:12 PM

Well I just installed not my OLD purchased full version of Omnipage but the free, stripped down teaser version which came with the MFC on an XP Virtual Machine. It scanned the offending document and a second on the same letterhead 99+ %. The only issue was with apostrophes and quotes. But that may have been a problem with the resulting text documents when I moved them to the Linux host - have not gone back and looked at them in Windows.

Boy am I depressed. I wish I could find a good OCR program for Linux.

Ken

H_TeXMeX_H 11-22-2010 12:24 PM

Unfortunately there is no 100% solution, nor a 99%, nor 90%, but maybe 70-80%.

I would use:
http://unpaper.berlios.de/
http://www.gnu.org/software/ocrad/ocrad.html

You may also want to use imagemagick or gimp to run a few filters like white balance, maybe some brightness contrast. Sometimes unsharp mask, threshold, levels, despeckle.

You kinda have to give it a standard black text on white background almost perfectly aligned with no specs. Then it may work up to 80% or so, sometimes more.

Tesseract is ok too, what was wrong with it ? I don't get it.

taylorkh 11-22-2010 04:39 PM

When I had GOCR selected the program ran unpaper first then OCR - at least that is what I recall seeing as the message windows flashed up.

Ken

H_TeXMeX_H 11-23-2010 06:33 AM

When you scan, make sure to run a preview scan first, it will auto-adjust some contrast and brightness settings that might help. Either way try ocrad.


All times are GMT -5. The time now is 01:36 AM.