including GOCR in a C++ program.

modafine · 05-16-2008, 09:49 AM

Hello everybody,

I'm working under a C++ project.

I want to learn text informations within an image.

I should apply an ocr to my input image to convert it into a text document. So I want to integrate GOCR in my C++ program.

Could you help me to find the steps to be followed to integrate the gocr in my program.

Thank you for help.

matthewg42 · 05-17-2008, 10:11 AM

In my experience gocr doesn't give nearly as accurate results as tesseract. There's API documentation for tesseract here.

modafine · 05-19-2008, 04:45 AM

thank you matthewg42.

I download tesseract-2.01 and i install it. the process of installation is like that:

./configure
make
make install
export TESSDATA_PREFIX="usr/local/share/"

but when i execute it "tesseract phototest.tif phototest -l eng"

I have this error message:

Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset

Can you help me to test this ocr because i have to choice one of these programs (tesseract or gocr) in order to integrate it in my c++ program.

thanks.

matthewg42 · 05-19-2008, 10:07 AM

I only ever installed it from the Ubuntu repositories, and it 'just worked'.

To use the command line too, you need to convert input images into tiff format first (or "MDR", whatever that is). I used the ImageMagick convert program to do this. e.g. using a page of text from the distributed proofreaders project:

Code:

wget http://www.pgdp.net/projects/projectID47d3b81d1228b/005.png
convert 005.png 005.tiff
tesseract 005.tiff 005

This produces the file 005.txt containing the OCR'd text.

I don't know how easy or otherwise it will be to use it from a program, rather than with the command line program.

Like most all OCR programs, it's not perfect, but it's pretty good compared to other free software OCR software.