LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   I need OCR software. (http://www.linuxquestions.org/questions/linux-software-2/i-need-ocr-software-808717/)

damgar 05-18-2010 07:56 PM

I need OCR software.
 
I'm looking for OCR software. I've installed gocr, and I'm not sure if I'm missing something or not, but the resulting text file was completely illegible, even though the original document was just a few sentences typed my son's teacher.

Any recommendations

kurwongbah 05-18-2010 09:19 PM

This has always done a pretty darn good job for me!

http://code.google.com/p/tesseract-ocr/

damgar 05-18-2010 11:48 PM

Thanks. I've been trying to get tesseract to work all night. I finally got it to build, but I get a seg fault each time I try to test it. I'm not really sure what the problem is. :doh:

catkin 05-19-2010 12:15 AM

Quote:

Originally Posted by damgar (Post 3973587)
Thanks. I've been trying to get tesseract to work all night. I finally got it to build, but I get a seg fault each time I try to test it. I'm not really sure what the problem is. :doh:

Try giving it a really simple input file name like x.tif

kurwongbah 05-19-2010 04:38 AM

Come to think of it, I believe it was in my distro...
"yum install tesseract" did the trick!

damgar 05-19-2010 07:43 AM

Quote:

Originally Posted by catkin (Post 3973620)
Try giving it a really simple input file name like x.tif

Same results.
Quote:

bash-4.1# tesseract /home/dtest/x.tif /home/dtest/x.txt -l eng
Tesseract Open Source OCR Engine
Segmentation fault

catkin 05-19-2010 09:37 AM

Quote:

Originally Posted by damgar (Post 3974005)
Same results.

tesseract 2.04 (built it using a slightly modified tesseract 2.03 SlackBuild) is working for me on Slackware 13.0 32-bit but did segfault on long input names. The command line that worked was tesseract z2.tif z2 (z2 is kind of catchy huh?).

kurwongbah 05-19-2010 06:29 PM

Gave it a go on my work pc. I was able to install from yum.
Still seems to work reasonably well.
I remembered it was very sensitive on the input resolution/file format.
The best results I'm getting are tif/600dpi.
How are you going?

damgar 05-19-2010 08:13 PM

Quote:

Originally Posted by catkin (Post 3974182)
tesseract 2.04 (built it using a slightly modified tesseract 2.03 SlackBuild) is working for me on Slackware 13.0 32-bit but did segfault on long input names. The command line that worked was tesseract z2.tif z2 (z2 is kind of catchy huh?).

Yes, I do like the name. I'm thinking it's probably a slackware-almost-current+tesseract issue. I had to do some manual patching to both the source and slackbuild, the slacky.eu packages give errors about libjpeg versions, and their slackbuild does a weird time out thing.. I don't have the time it would likely take me to figure this out, but with slack 13.1 just around the corner I'm hoping the slackbuild maintainers will know about it that I do. :)

catkin 05-20-2010 03:27 AM

On standard Slackware 13.0 the build was very simple. All I did to modify the SlackBuild from 2.03 from 2.04 was edit tesseract.SlackBuild:
  • changed the version.
  • removed the patch commands.
I put the desired language file (tesseract-2.00.eng.tar.gz) in the build directory in the normal way and ran the modified tesseract.SlackBuild.

Hopefully you are right and all will be well for you on Slackware 13.1.

ilnli 09-30-2010 03:56 PM

Or you can use http://www.ocrconvert.com online to convert your pdf file into text, I've found it very fast and the conversion is quite accurate.


All times are GMT -5. The time now is 05:02 PM.