LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 04-13-2016, 10:14 AM   #1
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,294

Rep: Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322
OCR pre processing help


I am trying to get some pre-scanned & antialiased pdf images to OCR, and seek advice. I need GOOD OCR, because I would like to zoom text to read it, instead of reading the (low res) images. To get pictures, I tried various utilities:
  • Pdfimages returns only junk
  • pdftoppm -r400 -tiff sort of does the job, but leaves a grey mess around the print no matter what antialias & font options are used
  • Gimp was used to set thresholds; that got rid of the light grey(Thresholded.png), but couldn't be automated, and gave varied results on the same page.
  • Imagemagick has endless option permutations, and gs likewise. No winning combo was found.
Using the sample below (as original.png) in tiff format with various options, I can't do much better than this
Code:
Born in i923 in the small fishing village of Stanley;
Tasinania,Iiilll!vloliisonleftsci1oolattlieagteot'I5
to hel run the family bakery. He soon went to sea
Anti Aliiasing is there from the start. If you zoom that you can see all the grey injected. Getting rid of it with gimp (Thresholded.png) gave this OCR:
Code:
Born in l923 in the small fishing village of Stanley.
Tasmania, Bill Mollison left school at the age of 15
to hel run the family bakery. He soon went to sea
as a s fisherman and seaman bringing vessels
Should I give up, or is there hope?Has anyone any 'convert,' 'mogrify,' or other magic they would recommend? I'm using tesseract-3.02 for OCR. Cuneiform-1.1 returns floating point exceptions 100% of the time on Slackware-14.1.
Attached Thumbnails
Click image for larger version

Name:	Original.png
Views:	12
Size:	33.6 KB
ID:	21461   Click image for larger version

Name:	Thresholded.png
Views:	9
Size:	3.7 KB
ID:	21462  
 
Old 04-13-2016, 10:20 AM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,294

Original Poster
Rep: Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322Reputation: 2322
Sorry - Double post.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
OCR abdoh Linux - Newbie 3 06-27-2009 11:41 PM
Processing data from a 'foreign' database with mysql, or tools to pre-process data. linker3000 Linux - Software 1 08-14-2007 08:36 PM
ocr John Master Linux - Software 7 06-12-2005 05:56 PM
Ocr apffal Linux - Software 1 06-12-2005 05:01 AM
OCR initialization failed accessing OCR device: PROC-26 cheeku Linux - Software 0 09-19-2004 08:36 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 05:51 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration