dgermann 08-04-2004 10:16 PM

Best OCR application?

I would like to do a lot of document imaging in my small office.

What is the best OCR app for use with scanned images in Linux?

Tinkster 08-04-2004 10:19 PM

I only know gocr and the front-end kooka ...
Not overly exciting if you've ever used omnipage
or recognita.

Did a search:


dgermann 08-04-2004 10:33 PM


Thanks for doing that search! As I spend more time with Linux, I will know where to go to do those sorts of searches.

Tinkster 08-04-2004 10:43 PM

ilnli 09-30-2010 09:14 PM


Use tesseract-ocr or if you want some easy to use service then use this online optical character recognition tool.

dgermann 09-30-2010 10:35 PM



Glad you found this thread! I am still looking.

What I have done for the interim is to use a WinXP machine to support both the scanning and then the OCR work via AABBY Fine Reader. That has been satisfactory for me, but it is about the only daily reason I have to have any Windows based machine on the premises.

I'll have to check into the tesseract-ocr reliability currently. I cannot use the online service because of confidentiality needs. Do you use either?

:- Doug.

ilnli 10-01-2010 06:23 AM

I've used tesseract-ocr which is good but it runs on Linux and you have to do some tweaks to your image to get good results from it, as the stuff I work on is confidential so I mainly used the which works for me.

pwc101 10-01-2010 06:46 AM

The best out of the box solution I've found is WatchOCR. It's a liveCD distro whose sole purpose is OCR. You put your images in a watch directory, and then a little script converts them into searchable PDFs. With some tweaking, it ought to be possible to save the text as well as the searchable PDF. For OCR it uses Curneiform, and layout analysis is done with ExactCode.

It's presumably possible to get Cuneiform and ExactCode installed on an existing system, though my understanding is that Cuneiform is difficult to get working.

Otherwise, there's OCROpus, which I haven't used, but seems promising.

H_TeXMeX_H 10-01-2010 10:45 AM

I've use tesseract and ocrad in the past, and you can get decent quality out of them if the input quality is good. Also check unpaper:

It will help the OCR work better. Sometimes you can also help it by using image filters like white balance and auto-levels, etc.

I don't think you can get as good as say AABBY, but it can be close if the input is good.

dgermann 10-12-2010 10:02 PM



I had not heard of unpaper before. I see it is in the repos for Ubuntu.

All of this stuff together still looks a little much for our production environment. We scan some pages every day, maybe only a dozen or two on most days, but then there are some days when we need to scan a hundred or so in an hour. It is important to be able to do reliable searches on the scanned documents.

So far, it sounds like having the scanner attached to a WinXP machine using AABBY is still the easiest thing to have a non technical person running: she merely feeds the paper in, chooses in the gui whether to scan one side or two, then lets it rip. When all are scanned, she comes back to the AABBY main screen and saves them to a file after the OCR does its work. Pretty simple, and it allows some turning of pages and rearranging the order of pages.

I think CLI would blow a couple of my people away!

Oh well, another reason to keep at least one Windows box on the system for another year or two....

Thanks, H_TeXMeX_H!

qrange 10-13-2010 03:17 PM

I tried tesseract but it was a disappointment. It couldn't OCR .png screenshot.
ABBYY finereader is probably the best, but not free. they even charge by page!

dgermann 10-13-2010 10:11 PM


Thanks for that tip about tesseract, qrange!

Have never had a per page charge from AABBY, so not sure what you're experiencing. It is a really good program. Just wish it were available in Linux. Hopefully some day soon--they have an SDK for Linux.

:- Doug.

qrange 10-14-2010 02:17 AM


well I was talking about Linux version (it exists!):

there's trial version.

dgermann 10-14-2010 08:39 PM


OIC! Thanks!

It does appear to actually be an AABBY site. But I agree it is pretty pricey. At the 12,000 pages per year it prices out to 1.75 cents per page at current exchange rates.

That's a lot particularly since you can buy it for Windows and have it forever, for $400--about 2 years' cost of the Linux version.

Knowing that there is a Linux version gives me hope that there will be reasonable pricing and perhaps some other commercial products soon. And maybe a gui Linux version!

Thanks, grange, for pointing this out!

