LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Suse/Novell (http://www.linuxquestions.org/questions/suse-novell-60/)
-   -   Anyone using kooka + ocr for accented chars? (http://www.linuxquestions.org/questions/suse-novell-60/anyone-using-kooka-ocr-for-accented-chars-356250/)

J_Szucs 08-23-2005 03:55 PM

Anyone using kooka + ocr for accented chars?
 
Documents containing accented chars are full of garbage after scanning and using the ocr under SuSE 9.1, while all accented chars disappear.

I think it is due to a character encoding issue.

Have you been faced with the same problem and have you found a solution?

fragos 08-25-2005 09:08 PM

OCR is almost a black art, particularly with accented characters. Recognition can be impacted by many things. For example, make sure the scanned sheet is lined up square on the glass. Kooka with SuSE 9.3 supports GOCR and OCRAD. Perhaps trying the other one will help. A perfect OCR translation may not be possible. Spell checkers can help fix errors. There may be a commercial package that is more accurate but, it won't be free if it exists.

J_Szucs 08-28-2005 12:48 PM

I finally found a totally undocumented kooka option by reviewing the kooka source code.

So, I had to put the following option in section [ocrad] in file ~/.kde/share/config/kookarc:
extraArguments=--format=utf8

Now I have the accented characters, yet ocrad makes much too many mistakes in the recognized text.
Since vuescan makes even more mistakes, i.e. not a single correctly recognized word in a small, printed Hungarian text, with white background, black letters, and 300 dpi resolution, I am still looking for an usable ocr solution for Linux.

aikempshall 08-29-2005 02:35 AM

I've scanned an article from the business section of a newspaper where the paper was pink. I scanned in gray at 600dpi. Then at the command line -

jpegtopnm gray600dpi.jpeg | pamditherbw -threshold -value 0.50 | pamtopnm | ocrad --

gave almost perfect results 4 wrong characters out of 33 lines, 181 words and 1109 characters. The problem characters were "gt" as in Washington and strength where the "g" was joined to the "t".

J_Szucs 08-29-2005 04:15 PM

Quote:

gave almost perfect results 4 wrong characters out of 33 lines
And how many of those 33 lines contained accented characters in a language other than english, french or german?

Just try it and see it yourself.

I can just repeat: none of those ocr's are even close to usability for this language, and probably for any other languages having accented chars and not having a dictionary file (e.g. for vuescan).
4 erraneous letters tend to be frequent in one single word here (and not in 33 lines).

aikempshall 08-31-2005 05:21 AM

In the source code for ocrad are the following files - examples/test.pbm and examples/test.txt these files contain accented characters - so it has the ability.

I've scanned in a document, obtained from http://simba.ara.bme.hu , at gray 600dpi to a jpeg file and run it through ocrad using -

jpegtopnm kscan_0005.jpeg | pamditherbw -threshold -value 0.50 | pamtopnm | ocrad -- > hugarian.txt

got the following result -

A na_ befogadóképességü közösségi épületek tüzvédelme a na_számú érintett
miatt kiemelten fontos terület. A megelözésen túl elöírások szabályozzák az esetleg
kiáakuló tüz esetére teljesítendö paramétereket, például a füstelszívó berendezés
szükséges teljesítményét. A feláramló füstgáznak nem szabad az evakuációs idönél
hamarabb elérnie a nézöteret, de fontos szempont az is, ho_ a tetöszerkezetet érö
höterhelés nem _engíti-e kritikus mértékben az épületet. A Budapest
Sportcsarnokot megsemmisítö tüzeset különösen ráirányította a fi_elmet a terület
fontosságára.
Vizsgálatunk célja a Budapest Sportaréna füstelszívó berendezésének
hatékonyságvizsgálata volt.

Is this what you get?

J_Szucs 09-01-2005 03:20 PM

Here are the best results I could reach with kooka+ocrad at 300 dpi:

A na befogadóképessé kôzôssé ép?letek ?wédelme a naszámú érintett
miatt bemelten fontos ter?let. A megelôzésen túl elôírások szabályoyák az esetleg Walló t?z esetére teljesítendô parétereket, például a füstelszívó berendezés sz?kséges teljesiményét. A felárló f?stgáznak nem szabad az evációs idánél harabb elémie a nézôteret, de fontos szempont az is, ho a tetöszerkezetet érô hôterhelés nem ent-e Úrms mértékben az épületet. A Budapest sportcsarnokot megsemmisítô tzeset különôsen ráirántotta a fielmet a ter?let
fontosságára.
Vúsgatunk célja a Budapest sportaréna f?stelszívó berendezésének
hatékonysá&?sgálata volt.

Its terrible.

Using the netpbmtools is out of question, as the ocr will be used by a disabled (almost blind) person, who would be unable to use the command line. Besides, she is unable to check scanning quality and fine-tune the necessary options.
She would use kooka + ocr + mbrola to convert newspaper articles or books to audible speech by daily routine. Being able to select areas (newspaper columns) to read would be a must, which would be either difficult or even impossible from the command line (as a matter of fact, it is quite difficult with kooka, too).
The tts part works well, but some unnecessary kooka dialog panels that cannot be disabled make the reading process quite complicated, and the ocr makes too many mistakes. Increasing resolution for better scan quality drastically increases scanning time, and tends to crash kooka (at cca. 500dpi).

I am on the point to give it up.

aikempshall 09-01-2005 03:55 PM

I can't really suggest much more. I have created a script -

jpegtopnm $3 | pamditherbw -threshold -value 0.50 | pamtopnm | ocrad $1 $2
echo myocrad finished
exit 0

which I called myocrad in the bin directory of my home directory and replaced the command /usr/bin/ocrad in the ocr settings of kooka.

I then amended void KSANEOCR::startOCRAD( ) in ksaneocr.cpp as follows -

{
/* The url is empty. Start the program to fill up a temp file */
//AIK m_ocrResultImage = ImgSaver::tempSaveImage( m_img, "BMP", 8 ); // m_tmpFile->name();
m_ocrResultImage = ImgSaver::tempSaveImage( m_img, "BMP", 8 ); // m_tmpFile->name();
kdDebug(28000) << "The BMP image name is <" << m_ocrResultImage << ">" << endl;
}

//AIK m_ocrImagePBM = ImgSaver::tempSaveImage( m_img, "PBM", 1 );
m_ocrImagePBM = ImgSaver::tempSaveImage( m_img, "JPEG", -1 );

This may get you a bit closer but is still won't get you close to your specific requirements, I use 600dpi, I guess that mbrola would need almost perfect results. The only other ocr I've used is omnipage and that can't give perfect results under windows or wine and even if it could would still need help identifying pictures and poor quality text.

Could not the articles be obtained electronically perhaps with wget?

In Britain we have an organisation that records newspapers onto tape casettes and distributes to those that need them, i.e a few people do the work so that many others can enjoy.

good luck.

J_Szucs 09-01-2005 06:10 PM

Quote:

I have created a script ... which I called myocrad in the bin directory of my home directory and replaced the command /usr/bin/ocrad in the ocr settings of kooka."
That is really interesting! I wanted to do something similar but with a bash script. The script should have replaced ocrad in the kooka settings, then it should have started ocrad, then mbrola to speak the results automatically. But my script was never run by kooka, even if I had saved it as /usr/bin/ocrad.
Now I can see how you did it, thanks!

Quote:

I guess that mbrola would need almost perfect results.
Fortunately not; mbrola simply speaks what you give it to speak. Besides, I found that it is even easier to understand a text when you hear it (human ears seem to be good at error correction), supposed that the text does not contain too many errors. But, as the number of errors increases, at a certain point it suddenly becomes "un-understandable".

There are some other ideas that could simplify the "read-and-speak" process, but I am not able to accomplish due to my little programming knowledge:
I wanted to modify the kooka sources to
- either show the scanned image in "page fit" size first (presently it opens it in the original size, thus most part of the image falls out of the screen, and the user must always zoom it out several times), or
- kooka should be able to scroll the image while you are making a selection
- disable the "ocrad dialog panel" (this could spare one more click for the user)
- make the boundary of the selected area thicker, so that it could be more visible
- disable automatic "cropping" of the image to the selected area when the user clicks on the "ocr of the selection" button (since, in this case, the user is very likely to select an other area from the same image the next time)


I wonder how difficult would the above modifications be?

aikempshall 09-02-2005 03:27 AM

If you will be just using ocrad I suggest stripping out the gocr and kadmos code to see what you are left with.

Then "bespoke" the rest. For instance at -

daemon = new KProcess;
Q_CHECK_PTR(daemon);

*daemon << cmd;
*daemon << QString("-x");
*daemon << m_tmpOrfName; // the orf result file
*daemon << QFile::encodeName( m_ocrImagePBM ); // The name of the image
*daemon << QString("-l");
*daemon << QString::number( ocrDia->layoutDetectionMode());


You could "hard code" your requirements for instance in my case it would look like this
though the script
jpegtopnm $3 | pamditherbw -threshold -value 0.50 | pamtopnm | ocrad $1 $2
is probably easier in that it's easier to change

*daemon << QString("jpegtopnm");
*daemon << QFile::encodeName( m_ocrImagePBM ); // The name of the image
*daemon << QString("|");
*daemon << QString("pamditherbw");
*daemon << QString("-threshold");
*daemon << QString("-value");
*daemon << QString("0.50");
*daemon << QString("|");
*daemon << QString("pamtopnm");
*daemon << QString("|");
*daemon << QString("ocrad");
*daemon << QString("-x");
*daemon << m_tmpOrfName;

though the script
jpegtopnm $3 | pamditherbw -threshold -value 0.50 | pamtopnm | ocrad $1 $2
is probably easier in that it's easier to change


The following code is most interesting in

/*
* this part is independant from the engine again
*/
if( m_ocrProcessDia )
{
m_ocrProcessDia->setupGui();

m_ocrProcessDia->introduceImage( m_img );
visibleOCRRunning = true;

connect( m_ocrProcessDia, SIGNAL( user1Clicked()), this, SLOT( startOCRProcess() ));
connect( m_ocrProcessDia, SIGNAL( closeClicked()), this, SLOT( slotClose() ));
connect( m_ocrProcessDia, SIGNAL( user2Clicked()), this, SLOT( slotStopOCR() ));
m_ocrProcessDia->show();

}

It's the code that displays the ocrad dialog panel. Also proceeds to the startOCRProcess() if the user presses the start button. You could hash out the three connect statements and replace with just a line containing startOCRProcess()

Would look like this -

// connect( m_ocrProcessDia, SIGNAL( user1Clicked()), this, SLOT( startOCRProcess() ));
// connect( m_ocrProcessDia, SIGNAL( closeClicked()), this, SLOT( slotClose() ));
// connect( m_ocrProcessDia, SIGNAL( user2Clicked()), this, SLOT( slotStopOCR() ));

m_ocrProcessDia->show();
startOCRProcess();

If you start down this road and get into problems raise a new thread.

However, none of this will help if the results of ths scan are not of suitable quality.

Good luck

J_Szucs 09-04-2005 11:19 AM

Thanks to you aikempshall, now kooka speaks!

I also found that the ocr thingy improves a bit if I scan at 600 dpi.

However, may I ask a little more help from you regarding the kooka sources?
I do not find how to modify kooka to open the preview image in "scaletowidth" mode by default (it is always scaled to page, thus the resulting image is too small, and the user must always mess about zooming it).
I already made some changes here and there in libkscan/img_canvas.cpp, but those did not work. I am pretty sure it would be just a damn small modification somewhere, but I cannot find where.

Besides, have you been faced with the fact that it is not possible to define keyboard shortcuts for the "scan final image" and "scan preview image" tasks? I think it is because those buttons are on a different window, not in Kooka's main window. Anyway, those shortcuts would be nice to appear in an early version of kooka, 'cause I badly need them :-(.

aikempshall 09-07-2005 07:11 AM

Have you tried detaching the windows? The upward/left pointing arrow in the top right hand corners of the various kooka views. Next to the x.

Be prepared for experimentation as once detached I found it difficult to attach them again.

As an aside do you have problems with OCR hanging if you Preview scan then Final Scan then OCR. On my machine I end up with a Zombie process and although kooka can scan to it's hearts content I cannot OCR unless I restart kooka. It's probably something I've done in changing code. Thought I'd ask before resinstalling the original package.

Regards

J_Szucs 09-08-2005 07:11 AM

Quote:

Have you tried detaching the windows?
Yes, but that is not an option, unfortunately, since floating windows will only confuse a disabled user. (She expects all buttons always be at the same place.)
The ideal solution would be one single window with just the controls needed. These would be: preview image window (with the preview image being as large as possible by default) + the following buttons: "scan final", "scan preview" and "ocr".

Quote:

As an aside do you have problems with OCR hanging if you Preview scan then Final Scan then OCR. On my machine I end up with a Zombie process and although kooka can scan to it's hearts content I cannot OCR unless I restart kooka. It's probably something I've done in changing code. Thought I'd ask before resinstalling the original package.
Probably. The ocr from kooka is stable here, I never get zombie processes.
My sources are probably different, as I did not make all the changes you posted here, just the minimum to call a wrapper script for ocrad.
Besides, my wrapper script is simpler, as I do not use the netpbm tools for preprocessing the image before ocr; there is no conversion to jpeg and such.
(It is due to the fact that I realized that my netpbm package is of an older version and misses some tools you use in your code. I did not want to confuse the SuSE package manager by installing a newer netpbm version from source, so I stayed with using the image types kooka uses by default).


All times are GMT -5. The time now is 06:28 AM.