OCR in Linux, unsatisfactory results

yermandu · 01-27-2010, 10:33 AM

Hellow Guys,

I have tested several software to use the OCR with my HP printer. Unfortunately the software that comes with it is only available for Mac OS and Windows. As I said I installed several software without success.
In my search I found that the Tesseract is better OCR application for Linux.

However I found two problems:

He does not have a GUI, a graphical interface, but it is possible to be done by commands, which is very boring when you want to scan several pages.
The results were very unsatisfactory, at least in my language "Portuguese", in a text with 1000 words he recognized only two or three which is very little, i text with several letters, magazines, books and folders.

So, i want help to install and use some OCR on Linux.

I have a HP Photosmart C4480.

H_TeXMeX_H · 01-27-2010, 01:37 PM

Well, there exist many GUI programs that do OCR, however I'm uncertain about their state of development, they may be alpha:
http://freshmeat.net/search?q=ocr&submit=Search
http://sourceforge.net/search/?type_...soft&words=ocr

I usually use ocrad and it actually produces decent results at least for English, but you should probably apply some image filters, and maybe use something like unpaper as well before you run ocrad on it, or the output will not be as good. The better the input image, the better the OCR translation.

It's true I haven't seen a truly professional OCR for Linux, but try some out, maybe there is one out there that you might find acceptable.

TB0ne · 01-27-2010, 01:38 PM

Quote:

Originally Posted by yermandu

Hellow Guys,

I have tested several software to use the OCR with my HP printer. Unfortunately the software that comes with it is only available for Mac OS and Windows. As I said I installed several software without success.
In my search I found that the Tesseract is better OCR application for Linux.

However I found two problems:

He does not have a GUI, a graphical interface, but it is possible to be done by commands, which is very boring when you want to scan several pages.
The results were very unsatisfactory, at least in my language "Portuguese", in a text with 1000 words he recognized only two or three which is very little, i text with several letters, magazines, books and folders.

So, i want help to install and use some OCR on Linux.

You can install GOCR, as it has a GUI, but Tesseract is much more accurate, providing you use it correctly. A quick Google search turns up:

http://www.linux.com/archive/feature/138511

which has examples, instructions, and basic shell scripts to 'automate' OCR of a bunch of pages. Note that if you don't load the right language (German, English, etc.), accuracy is always going to be bad.