Best OCR application?

dgermann · 08-04-2004, 09:16 PM

Hi--

I would like to do a lot of document imaging in my small office.

What is the best OCR app for use with scanned images in Linux?

Tinkster · 08-04-2004, 09:19 PM

I only know gocr and the front-end kooka ...
Not overly exciting if you've ever used omnipage
or recognita.

[edit]
Did a freshmeat.net search:
http://freshmeat.net/search/?q=%2BOC...y_percent_DESC
[/edit]

Cheers,
Tink

dgermann · 08-04-2004, 09:33 PM

Tink--

Thanks for doing that search! As I spend more time with Linux, I will know where to go to do those sorts of searches.

BTW, what is the reference to kiwis in your sig file?

Tinkster · 08-04-2004, 09:43 PM

New Zealanders are commonly referred to as Kiwis,
the Kiwi bird is their national symbol.

Cheers,
Tink

ilnli · 09-30-2010, 08:14 PM

Quote:

Originally Posted by dgermann

Tink--

Thanks for doing that search! As I spend more time with Linux, I will know where to go to do those sorts of searches.

BTW, what is the reference to kiwis in your sig file?

Use tesseract-ocr or if you want some easy to use service then use this online optical character recognition tool.

dgermann · 09-30-2010, 09:35 PM

ilnli--

Thanks!

Glad you found this thread! I am still looking.

What I have done for the interim is to use a WinXP machine to support both the scanning and then the OCR work via AABBY Fine Reader. That has been satisfactory for me, but it is about the only daily reason I have to have any Windows based machine on the premises.

I'll have to check into the tesseract-ocr reliability currently. I cannot use the online service because of confidentiality needs. Do you use either?

:- Doug.

ilnli · 10-01-2010, 05:23 AM

I've used tesseract-ocr which is good but it runs on Linux and you have to do some tweaks to your image to get good results from it, as the stuff I work on is confidential so I mainly used the ocrconvert.com which works for me.

pwc101 · 10-01-2010, 05:46 AM

The best out of the box solution I've found is WatchOCR. It's a liveCD distro whose sole purpose is OCR. You put your images in a watch directory, and then a little script converts them into searchable PDFs. With some tweaking, it ought to be possible to save the text as well as the searchable PDF. For OCR it uses Curneiform, and layout analysis is done with ExactCode.

It's presumably possible to get Cuneiform and ExactCode installed on an existing system, though my understanding is that Cuneiform is difficult to get working.

Otherwise, there's OCROpus, which I haven't used, but seems promising.

H_TeXMeX_H · 10-01-2010, 09:45 AM

I've use tesseract and ocrad in the past, and you can get decent quality out of them if the input quality is good. Also check unpaper:
http://unpaper.berlios.de/

It will help the OCR work better. Sometimes you can also help it by using image filters like white balance and auto-levels, etc.

I don't think you can get as good as say AABBY, but it can be close if the input is good.

dgermann · 10-12-2010, 09:02 PM

H_TeXMeX_H--

Thanks!

I had not heard of unpaper before. I see it is in the repos for Ubuntu.

All of this stuff together still looks a little much for our production environment. We scan some pages every day, maybe only a dozen or two on most days, but then there are some days when we need to scan a hundred or so in an hour. It is important to be able to do reliable searches on the scanned documents.

So far, it sounds like having the scanner attached to a WinXP machine using AABBY is still the easiest thing to have a non technical person running: she merely feeds the paper in, chooses in the gui whether to scan one side or two, then lets it rip. When all are scanned, she comes back to the AABBY main screen and saves them to a file after the OCR does its work. Pretty simple, and it allows some turning of pages and rearranging the order of pages.

I think CLI would blow a couple of my people away!

Oh well, another reason to keep at least one Windows box on the system for another year or two....

Thanks, H_TeXMeX_H!

qrange · 10-13-2010, 02:17 PM

I tried tesseract but it was a disappointment. It couldn't OCR .png screenshot.
ABBYY finereader is probably the best, but not free. they even charge by page!

dgermann · 10-13-2010, 09:11 PM

qrange--

Thanks for that tip about tesseract, qrange!

Have never had a per page charge from AABBY, so not sure what you're experiencing. It is a really good program. Just wish it were available in Linux. Hopefully some day soon--they have an SDK for Linux.

:- Doug.

qrange · 10-14-2010, 01:17 AM

@dgermann

well I was talking about Linux version (it exists!):
http://www.ocr4linux.com/en

ricing

there's trial version.

dgermann · 10-14-2010, 07:39 PM

qrange--

OIC! Thanks!

It does appear to actually be an AABBY site. But I agree it is pretty pricey. At the 12,000 pages per year it prices out to 1.75 cents per page at current exchange rates.

That's a lot particularly since you can buy it for Windows and have it forever, for $400--about 2 years' cost of the Linux version.

Knowing that there is a Linux version gives me hope that there will be reasonable pricing and perhaps some other commercial products soon. And maybe a gui Linux version!

Thanks, grange, for pointing this out!