Looking for OCR software tips. Anybody got any?

rnturn · 02-12-2013, 03:55 PM

I have some very old typewritten documents that I've been trying to scan/convert-to-text using xsane->PBM->ocrad and I've been getting really, really awful results. So bad, in fact, that re-typing the documents looks like a better way to go than waste any more time with OCR. I realize that some cleanup of the OCR output is almost a given but what I'm seeing is more like 99% would have to be rekeyed. I haven't used any OCR software since the Win3.11 days and, while OCR's results weren't 100% accurate back then, it was orders of magnitude better than what I'm seeing. I would have expected that typewritten text would be a piece of cake to convert than, say, a photocopied magazine article with proportional fonts, kerning, etc.

"ocrad" recommends having at least 20 pixels per character and I've scanned the original documents at resolutions ranging from 128bpi to 2400bpi (maybe excessive, I know) and the results stink no matter what.

Does anyone have any experience using this combination of software and has gotten reasonable results? Is there a better OSS OCR package than ocrad?

TIA...

--
Rick

ArfaSmif · 02-12-2013, 09:34 PM

I haven't used any of these, but the following are "popular" and noted in the literature. You can try "tesseract" and/or "gocr" both command line. There are a few guis for these command line ocrs, for example "gImageReader", OcrGui". You look like you use rpm based linux, so you may get lucky at rpmfind.net or rpm.pbone.net . Good luck. Hope it helps. Let us know how you go.

jefro · 02-12-2013, 09:42 PM

I've tried all the free linux ocr stuff (that I know of) without success. I have not tested any of the current windows apps either in wine or windows.

I too played with some very old windows OCR and a few very specialized ocr uses. I have no idea why the current linux ocr is so bad. Remember when you could actually see it trying to resolve each single character?

At one time we used special type disks or balls for typing into OCRX or some of the other fonts.

I gave up and found that a not for profit I help has a Xerox copier that somehow seems to get fantastic results in a few minutes or waiting.

H_TeXMeX_H · 02-13-2013, 02:06 AM

Use tesseract and try to make sure the images you scanned are lined up properly, and use GIMP to make the pages as white on black as possible and reduce noise if needed.

Possibly useful:
https://github.com/Flameeyes/unpaper