LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   OCR pre processing help (https://www.linuxquestions.org/questions/linux-software-2/ocr-pre-processing-help-4175577358/)

business_kid 04-13-2016 10:14 AM

OCR pre processing help
 
2 Attachment(s)
I am trying to get some pre-scanned & antialiased pdf images to OCR, and seek advice. I need GOOD OCR, because I would like to zoom text to read it, instead of reading the (low res) images. To get pictures, I tried various utilities:
  • Pdfimages returns only junk
  • pdftoppm -r400 -tiff sort of does the job, but leaves a grey mess around the print no matter what antialias & font options are used
  • Gimp was used to set thresholds; that got rid of the light grey(Thresholded.png), but couldn't be automated, and gave varied results on the same page.
  • Imagemagick has endless option permutations, and gs likewise. No winning combo was found.
Using the sample below (as original.png) in tiff format with various options, I can't do much better than this
Code:

Born in i923 in the small fishing village of Stanley;
Tasinania,Iiilll!vloliisonleftsci1oolattlieagteot'I5
to hel run the family bakery. He soon went to sea

Anti Aliiasing is there from the start. If you zoom that you can see all the grey injected. Getting rid of it with gimp (Thresholded.png) gave this OCR:
Code:

Born in l923 in the small fishing village of Stanley.
Tasmania, Bill Mollison left school at the age of 15
to hel run the family bakery. He soon went to sea
as a s fisherman and seaman bringing vessels

Should I give up, or is there hope?Has anyone any 'convert,' 'mogrify,' or other magic they would recommend? I'm using tesseract-3.02 for OCR. Cuneiform-1.1 returns floating point exceptions 100% of the time on Slackware-14.1.

business_kid 04-13-2016 10:20 AM

Sorry - Double post.


All times are GMT -5. The time now is 08:09 PM.