Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Just started using ocrad to optical character recognition scanned images inside kooka. To put it mildly it does not work well: A clean letter (i.e. through the post) was just gibberish. Very occasionally it nearly gets a word right.
Gather it is possible to use ocrad with some success. Someone give me a starting point??
Well, I have to admit that it's not that good of an OCR. But, you can make it significantly better if you do a bit of image filtering / enhancing with GIMP or Imagemagick first. I did some tests a while ago and doing some basic color balancing, white balancing, and a few other filters can make a big difference. Another useful program is: http://freshmeat.net/projects/unpaper/
You should also try messing around with the ocrad command line options to fine tune results.
In the end you can get quite decent results using these methods. Experiment and see what works best.
Discovered that most of the problem was kooka, the K graphical scanner thing. I think it is the most buggy program I have ever come accross - it has more bugs than my grandmother's pubes, as they say in Newcastle.
So lost kooka and used command line. Scanned with resolution of 600, converted to .pnm (this seems essential) and results improved dramatically. A clean letter is almost without error. I'll try unpaper.
(Incidentally is it the most effective bit of propaganda ever that anarchy = no law?)
I've never used kooka, and recently I have finally gotten rid of all kde programs and found replacement for them. I did this because many were either very buggy or very bloated and slow or very annoying (they would add themselves to the taskbar without my permission, would give me pop-ups and sounds that I didn't want, yuck).
[offtopic]
(Incidentally is it the most effective bit of propaganda ever that anarchy = no law?)
Indeed it is, that's one of the reasons I found that definition, the only definition in any dictionary (that I know of) that comes close to being accurate. Anarchy is not really about laws or chaos or destruction, but instead about government and especially coercive, hierarchical, bureaucratic, mechanistic, and corrupt government. The more power the more corruption.
Might want to look at other OCR options as well.. I've always heard that Tesseract is the best OCR for Linux. .Tesseract was originally written by HP, but is now GPL and one of hte Google code projects available under the Apache license.
Quote:
http://www.mscs.dal.ca/~selinger/ocr-test/
* Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test.
* Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy.
* GOCR gives poor output at a slow speed.
Quote:
http://groundstate.ca/ocr
The combination of Tesseract and Ocropus is clearly the project we can most rely on to provide the missing elements of a full-featured Free OCR suite.
Just figuired if you were going to put effort into working with Linux OCR, you might want to check out what is reported as one of the more accurate programs. http://code.google.com/p/tesseract-ocr/
Way I do it: In Gimp; file, aquire, xsane, <scanner name>
In xsane; Select correct bit and clean it up a bit with eyedroppers. Use defaults except set to grey.
In Gimp clean it up some more if necessary (Colours -> Levels is useful), get rid of logos, and save as image.tif (note only one f).
Then:
$ tesseract image.tif image
It saves a file called image.txt. This has few errors but no layout.
Last edited by lugoteehalt; 02-07-2009 at 04:47 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.