LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-02-2009, 11:08 AM   #1
lugoteehalt
Senior Member
 
Registered: Sep 2003
Location: UK
Distribution: Debian
Posts: 1,215
Blog Entries: 2

Rep: Reputation: 49
Does OCR work in practice, kooka ocrad?


Just started using ocrad to optical character recognition scanned images inside kooka. To put it mildly it does not work well: A clean letter (i.e. through the post) was just gibberish. Very occasionally it nearly gets a word right.

Gather it is possible to use ocrad with some success. Someone give me a starting point??
 
Old 02-02-2009, 02:45 PM   #2
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Well, I have to admit that it's not that good of an OCR. But, you can make it significantly better if you do a bit of image filtering / enhancing with GIMP or Imagemagick first. I did some tests a while ago and doing some basic color balancing, white balancing, and a few other filters can make a big difference. Another useful program is:
http://freshmeat.net/projects/unpaper/
You should also try messing around with the ocrad command line options to fine tune results.

In the end you can get quite decent results using these methods. Experiment and see what works best.
 
Old 02-03-2009, 10:11 AM   #3
lugoteehalt
Senior Member
 
Registered: Sep 2003
Location: UK
Distribution: Debian
Posts: 1,215

Original Poster
Blog Entries: 2

Rep: Reputation: 49
Discovered that most of the problem was kooka, the K graphical scanner thing. I think it is the most buggy program I have ever come accross - it has more bugs than my grandmother's pubes, as they say in Newcastle.

So lost kooka and used command line. Scanned with resolution of 600, converted to .pnm (this seems essential) and results improved dramatically. A clean letter is almost without error. I'll try unpaper.

(Incidentally is it the most effective bit of propaganda ever that anarchy = no law?)

Thanks.
 
Old 02-03-2009, 10:44 AM   #4
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
I've never used kooka, and recently I have finally gotten rid of all kde programs and found replacement for them. I did this because many were either very buggy or very bloated and slow or very annoying (they would add themselves to the taskbar without my permission, would give me pop-ups and sounds that I didn't want, yuck).

[offtopic]
(Incidentally is it the most effective bit of propaganda ever that anarchy = no law?)
Indeed it is, that's one of the reasons I found that definition, the only definition in any dictionary (that I know of) that comes close to being accurate. Anarchy is not really about laws or chaos or destruction, but instead about government and especially coercive, hierarchical, bureaucratic, mechanistic, and corrupt government. The more power the more corruption.
 
Old 02-03-2009, 11:08 AM   #5
farslayer
LQ Guru
 
Registered: Oct 2005
Location: Northeast Ohio
Distribution: linuxdebian
Posts: 7,249
Blog Entries: 5

Rep: Reputation: 191Reputation: 191
Might want to look at other OCR options as well.. I've always heard that Tesseract is the best OCR for Linux. .Tesseract was originally written by HP, but is now GPL and one of hte Google code projects available under the Apache license.

Quote:
http://www.mscs.dal.ca/~selinger/ocr-test/
* Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test.
* Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy.
* GOCR gives poor output at a slow speed.
Quote:
http://groundstate.ca/ocr
The combination of Tesseract and Ocropus is clearly the project we can most rely on to provide the missing elements of a full-featured Free OCR suite.
http://www.linuxjournal.com/article/9676
http://www.linux-archive.org/debian-...ocr-works.html
http://www.linux.com/articles/57222


Just figuired if you were going to put effort into working with Linux OCR, you might want to check out what is reported as one of the more accurate programs.
http://code.google.com/p/tesseract-ocr/

tesseract-ocr is in the Debian repositories.
 
Old 02-03-2009, 12:00 PM   #6
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Thanks, didn't know about Tesseract, must try it ...
 
Old 02-04-2009, 11:52 AM   #7
lugoteehalt
Senior Member
 
Registered: Sep 2003
Location: UK
Distribution: Debian
Posts: 1,215

Original Poster
Blog Entries: 2

Rep: Reputation: 49
I'll try tesseract again. Probably doing something wrong before. Thanks.

Tried it and got better results than ocrad.

For the referrence:

Way I do it: In Gimp; file, aquire, xsane, <scanner name>
In xsane; Select correct bit and clean it up a bit with eyedroppers. Use defaults except set to grey.
In Gimp clean it up some more if necessary (Colours -> Levels is useful), get rid of logos, and save as image.tif (note only one f).
Then:
$ tesseract image.tif image

It saves a file called image.txt. This has few errors but no layout.

Last edited by lugoteehalt; 02-07-2009 at 04:47 AM.
 
Old 02-04-2009, 12:12 PM   #8
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Just to followup, I did manage to install tesseract and it works quite well. It only seems to work on tiff, but that's not too much of an issue.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
OCR woes with Kooka moxieman99 Linux - Software 3 11-25-2005 03:01 PM
Anyone using kooka + ocr for accented chars? J_Szucs SUSE / openSUSE 12 09-08-2005 07:11 AM
OCRAD returns gibberish EVERY time - is there a good HowTo? KimVette Linux - Software 18 08-29-2005 02:24 AM
Cyrillic characters recognition with OCRAD? z-vet Linux - Software 0 08-06-2005 04:52 AM
OCR initialization failed accessing OCR device: PROC-26 cheeku Linux - Software 0 09-19-2004 08:36 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:41 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration