Does OCR work in practice, kooka ocrad?

lugoteehalt · 02-02-2009, 11:08 AM

Just started using ocrad to optical character recognition scanned images inside kooka. To put it mildly it does not work well: A clean letter (i.e. through the post) was just gibberish. Very occasionally it nearly gets a word right.

Gather it is possible to use ocrad with some success. Someone give me a starting point??

H_TeXMeX_H · 02-02-2009, 02:45 PM

Well, I have to admit that it's not that good of an OCR. But, you can make it significantly better if you do a bit of image filtering / enhancing with GIMP or Imagemagick first. I did some tests a while ago and doing some basic color balancing, white balancing, and a few other filters can make a big difference. Another useful program is:
http://freshmeat.net/projects/unpaper/
You should also try messing around with the ocrad command line options to fine tune results.

In the end you can get quite decent results using these methods. Experiment and see what works best.

lugoteehalt · 02-03-2009, 10:11 AM

Discovered that most of the problem was kooka, the K graphical scanner thing. I think it is the most buggy program I have ever come accross - it has more bugs than my grandmother's pubes, as they say in Newcastle.

So lost kooka and used command line. Scanned with resolution of 600, converted to .pnm (this seems essential) and results improved dramatically. A clean letter is almost without error. I'll try unpaper.

(Incidentally is it the most effective bit of propaganda ever that anarchy = no law?)

Thanks.

H_TeXMeX_H · 02-03-2009, 10:44 AM

I've never used kooka, and recently I have finally gotten rid of all kde programs and found replacement for them. I did this because many were either very buggy or very bloated and slow or very annoying (they would add themselves to the taskbar without my permission, would give me pop-ups and sounds that I didn't want, yuck).

[offtopic]
(Incidentally is it the most effective bit of propaganda ever that anarchy = no law?)
Indeed it is, that's one of the reasons I found that definition, the only definition in any dictionary (that I know of) that comes close to being accurate. Anarchy is not really about laws or chaos or destruction, but instead about government and especially coercive, hierarchical, bureaucratic, mechanistic, and corrupt government. The more power the more corruption.

farslayer · 02-03-2009, 11:08 AM

Might want to look at other OCR options as well.. I've always heard that Tesseract is the best OCR for Linux. .Tesseract was originally written by HP, but is now GPL and one of hte Google code projects available under the Apache license.

Quote:

http://www.mscs.dal.ca/~selinger/ocr-test/
* Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test.
* Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy.
* GOCR gives poor output at a slow speed.

Quote:

http://groundstate.ca/ocr
The combination of Tesseract and Ocropus is clearly the project we can most rely on to provide the missing elements of a full-featured Free OCR suite.

http://www.linuxjournal.com/article/9676
http://www.linux-archive.org/debian-...ocr-works.html
http://www.linux.com/articles/57222

Just figuired if you were going to put effort into working with Linux OCR, you might want to check out what is reported as one of the more accurate programs.
http://code.google.com/p/tesseract-ocr/

tesseract-ocr is in the Debian repositories.

H_TeXMeX_H · 02-03-2009, 12:00 PM

Thanks, didn't know about Tesseract, must try it ...

lugoteehalt · 02-04-2009, 11:52 AM

I'll try tesseract again. Probably doing something wrong before. Thanks.

Tried it and got better results than ocrad.

For the referrence:

Way I do it: In Gimp; file, aquire, xsane, <scanner name>
In xsane; Select correct bit and clean it up a bit with eyedroppers. Use defaults except set to grey.
In Gimp clean it up some more if necessary (Colours -> Levels is useful), get rid of logos, and save as image.tif (note only one f).
Then:
$ tesseract image.tif image

It saves a file called image.txt. This has few errors but no layout.

H_TeXMeX_H · 02-04-2009, 12:12 PM

Just to followup, I did manage to install tesseract and it works quite well. It only seems to work on tiff, but that's not too much of an issue.