[SOLVED] OCR pre processing help

business_kid · 04-13-2016, 10:14 AM

I am trying to get some pre-scanned & antialiased pdf images to OCR, and seek advice. I need GOOD OCR, because I would like to zoom text to read it, instead of reading the (low res) images. To get pictures, I tried various utilities:

Pdfimages returns only junk
pdftoppm -r400 -tiff sort of does the job, but leaves a grey mess around the print no matter what antialias & font options are used
Gimp was used to set thresholds; that got rid of the light grey(Thresholded.png), but couldn't be automated, and gave varied results on the same page.
Imagemagick has endless option permutations, and gs likewise. No winning combo was found.

Using the sample below (as original.png) in tiff format with various options, I can't do much better than this

Code:

Born in i923 in the small fishing village of Stanley;
Tasinania,Iiilll!vloliisonleftsci1oolattlieagteot'I5
to hel run the family bakery. He soon went to sea

Anti Aliiasing is there from the start. If you zoom that you can see all the grey injected. Getting rid of it with gimp (Thresholded.png) gave this OCR:

Code:

Born in l923 in the small ﬁshing village of Stanley.
Tasmania, Bill Mollison left school at the age of 15
to hel run the family bakery. He soon went to sea
as a s ﬁsherman and seaman bringing vessels

Should I give up, or is there hope?Has anyone any 'convert,' 'mogrify,' or other magic they would recommend? I'm using tesseract-3.02 for OCR. Cuneiform-1.1 returns floating point exceptions 100% of the time on Slackware-14.1.

business_kid · 04-13-2016, 10:20 AM

Sorry - Double post.