LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 04-13-2016, 11:14 AM   #1
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, RPi OS, Mint & Android
Posts: 13,397

Rep: Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831
OCR pre processing help


I am trying to get some pre-scanned & antialiased pdf images to OCR, and seek advice. I need GOOD OCR, because I would like to zoom text to read it, instead of reading the (low res) images. To get pictures, I tried various utilities:
  • Pdfimages returns only junk
  • pdftoppm -r400 -tiff sort of does the job, but leaves a grey mess around the print no matter what antialias & font options are used
  • Gimp was used to set thresholds; that got rid of the light grey(Thresholded.png), but couldn't be automated, and gave varied results on the same page.
  • Imagemagick has endless option permutations, and gs likewise. No winning combo was found.
Using the sample below (as original.png) in tiff format with various options, I can't do much better than this
Code:
Born in i923 in the small fishing village of Stanley;
Tasinania,Iiilll!vloliisonleftsci1oolattlieagteot'I5
to hel run the family bakery. He soon went to sea
Anti Aliiasing is there from the start. If you zoom that you can see all the grey injected. Getting rid of it with gimp (Thresholded.png) gave this OCR:
Code:
Born in l923 in the small fishing village of Stanley.
Tasmania, Bill Mollison left school at the age of 15
to hel run the family bakery. He soon went to sea
as a s fisherman and seaman bringing vessels
Should I give up, or is there hope?Has anyone any 'convert,' 'mogrify,' or other magic they would recommend? I'm using tesseract-3.02 for OCR. Cuneiform-1.1 returns floating point exceptions 100% of the time on Slackware-14.1.
Attached Thumbnails
Click image for larger version

Name:	Original.png
Views:	9
Size:	33.6 KB
ID:	21461   Click image for larger version

Name:	Thresholded.png
Views:	8
Size:	3.7 KB
ID:	21462  
 
Old 04-13-2016, 11:20 AM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, RPi OS, Mint & Android
Posts: 13,397

Original Poster
Rep: Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831Reputation: 1831
Sorry - Double post.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
OCR abdoh Linux - Newbie 3 06-28-2009 12:41 AM
Processing data from a 'foreign' database with mysql, or tools to pre-process data. linker3000 Linux - Software 1 08-14-2007 09:36 PM
ocr John Master Linux - Software 7 06-12-2005 06:56 PM
Ocr apffal Linux - Software 1 06-12-2005 06:01 AM
OCR initialization failed accessing OCR device: PROC-26 cheeku Linux - Software 0 09-19-2004 09:36 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 06:26 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration