OCRAD returns gibberish EVERY time

KimVette · 07-17-2005, 12:08 AM

i am running into issues with ocrad and kooka. No matter what image I feed it, regardless of format, typeface, DPI, color depth, etc. all ocrad returns for text is gibberish. Is there a howto out there which explains ocrad troubleshooting?

I am currently at:

ocrad 0.11
kooka 0.44
KDE 3.4.1

I have never been successful when trying to get any OCR package to work correctly under Linux. I've searched both Google and Yahoo for info on ocrad (searched for ocrad howto, ocrad how to, and ocrad troubleshooting) and all I can find is download sites for ocrad, and doorway pages (search engine spam)).

Can anyone point me in the right direction? Is there any sort of "benchmark" image used to calibrate ocrad so we can compare results?

KimVette · 07-17-2005, 05:28 PM

I tried gocr as well after posting this and get the same results. I am thinking of trying OmniPage under wine, if it will run.

iamjiwjr · 08-17-2005, 05:32 PM

I gave up on Kooka and went to Vuescan. It works perfectly for me. It wasn't free, but it is very good.

OCR is unformatted text only, but accurate.

Good luck.

aikempshall · 08-23-2005, 06:32 AM

See my reply to

http://www.linuxquestions.org/questi...87#post1814987

Regards

KimVette · 08-23-2005, 08:36 PM

Hmm so that is my problem. How do I fix it?

aikempshall · 08-24-2005, 08:42 AM

I assume you mean the UTF-8 / Suse issue?

KimVette · 08-24-2005, 08:46 PM

aikempshall · 08-25-2005, 05:03 AM

If I type locale at the command line I get -

LANG=en_GB.iso88591
LC_CTYPE="en_GB"
LC_NUMERIC="en_GB"
LC_TIME="en_GB"
LC_COLLATE="en_GB"
LC_MONETARY="en_GB"
LC_MESSAGES="en_GB"
LC_PAPER="en_GB"
LC_NAME="en_GB"
LC_ADDRESS="en_GB"
LC_TELEPHONE="en_GB"
LC_MEASUREMENT="en_GB"
LC_IDENTIFICATION="en_GB"
LC_ALL=en_GB

What response do you get?

Also at command line ocrad --charset=help

What response?

KimVette · 08-25-2005, 09:04 PM

Code:

kim@kimp4:~> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
kim@kimp4:~> ocrad --charset=help
Valid charset names are:  ascii  iso-8859-9  iso-8859-15
kim@kimp4:~>

Based on that I presume I have to switch the system to iso-8859-15 ? doing a quick search (man-k) I see that this is defined in /etc/sysconfig/langage and the setting I need to change is:

Code:

RC_LANG="en_US.UTF-8"

to iso-8859-15

However, then it wouldn't be unicode, right?

Is this correct?

aikempshall · 08-26-2005, 04:03 AM

Correct - that would be my interpretation also. By design and if the output from ocrad is iso-8859 there may be a tool that will convert the output into UTF-8.

KimVette · 08-26-2005, 09:11 PM

Thanks, aikempshall

KimVette · 08-26-2005, 09:46 PM

still getting:

Code:

im@kimp4:~> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
kim@kimp4:~>

after changing /etc/sysconfing/language. . . and of course gibberish from gocr (and ocrad)

Trying RC_LANG="iso-8859-15" next.

KimVette · 08-26-2005, 09:57 PM

changed it again (this time to en_US.ISO-8859-1) and rebooted (Again!), no dice.

Code:

kim@kimp4:/etc/sysconfig> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

aikempshall · 08-27-2005, 06:34 AM

I think what you need to do, on a temporary basis, is set LC_ALL to whatever and then export LC_ALL then try locale.

When I did that for en_US.UTF-8 the change reflected in locale but made no difference to the output of ocrad!!!!!!!!!!!!!!!!!!!!

ocrad -help

returns

GNU Ocrad - Optical Character Recognition program.
Reads pbm file(s), or standard input, and sends text to standard output.

Usage: ocrad [options] [files]
Options:
-h, --help display this help and exit
-V, --version output version information and exit
-a, --append append text to output file
-b, --block=<n> process only the specified text block
-c, --charset=<name> try `--charset=help' for a list of names
-f, --force force overwrite of output file
-F, --format=<fmt> output format (byte, utf8)
-i, --invert invert image levels (white on black)
-l, --layout=<n> layout analysis, 0=none, 1=column, 2=full
-o <file> place the output into <file>
-s, --scale=[-]<n> scale input image by [1/]<n>
-t, --transform=<name> try `--transform=help' for a list of names
-v, --verbose be verbose
-x <file> export OCR Results File to <file>

the -F flag suggest output in utf8

Have you tried ocrad at the command line?

KimVette · 08-27-2005, 12:03 PM

I have not tried that (ocrad from the command line) - I'm trying to find a solution anyone can use. oh and I just exported LC_ALL with the iso setting. Here is the output from locale now:

Code:

LC_ALL=en_US.ISO-8859-1
kim@kimp4:~> export LC_ALL
kim@kimp4:~> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
kim@kimp4:~>

Is it possible to reconfigure kooka, etc. to call ocrad or gocr with that command line argument to force unicode (is it defined in a config file somewhere), or is that bit hard-coded into kooka, requiring recompilation?

Also; if I were to set an alias for say, ocrad pointing to ocrad -F "en_us.UTF-8" will applications pick up the alias and use that, or will their command line arguments completely override the alias?

I suppose last resort I could rename or move ocrad and gocr and write a shell script to take kooka's arguments and pass them on, only substituting the desired argument for UTF-8.