LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices


Reply
  Search this Thread
Old 01-22-2015, 02:21 PM   #1
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 4,442

Rep: Reputation: 76
Optical character recognition software.


Hi: any known good OCR program for Linux? The intended goal is to make an assembler source file from the source that is printed in a book.
 
Old 01-22-2015, 02:34 PM   #2
bassmadrigal
LQ Guru
 
Registered: Nov 2003
Location: West Jordan, UT, USA
Distribution: Slackware
Posts: 8,792

Rep: Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656
A quick search of "OCR linux" on Google returned this as the first result.

https://help.ubuntu.com/community/OCR

And there's several on slackbuilds.org

http://slackbuilds.org/result/?search=ocr&sv=14.1

Have you looked into any of these? If you've tried some and they didn't work for your needs, that could help.
 
Old 01-22-2015, 02:41 PM   #3
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 4,442

Original Poster
Rep: Reputation: 76
I did the same was guided to slackbuilds tesseract. The question is now, is it any good a PDF file? That is, I begin with a PDF some pages of the book, which I can do three blocks from home. Does tesseract convert the PDF into a plain ASCII text file?
 
Old 01-22-2015, 02:52 PM   #4
ttk
Senior Member
 
Registered: May 2012
Location: Sebastopol, CA
Distribution: Slackware64
Posts: 1,038
Blog Entries: 27

Rep: Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484
When I last looked at it (a few years ago), gocr was the best, with ocrad a distant second.
 
Old 01-22-2015, 03:25 PM   #5
bassmadrigal
LQ Guru
 
Registered: Nov 2003
Location: West Jordan, UT, USA
Distribution: Slackware
Posts: 8,792

Rep: Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656Reputation: 6656
Quote:
Originally Posted by stf92 View Post
Does tesseract convert the PDF into a plain ASCII text file?
No, the input has to be an image.

Quote:
...it can read a wide variety of image formats and convert them to text in over 60 languages

SOURCE: http://code.google.com/p/tesseract-ocr/
You can convert the pdf to an image using imagemagick (included in a FULL Slackware install).

Code:
convert -density 600 input.pdf output.tif
I'm not sure about gocr since the site is blocked at work.
 
1 members found this post helpful.
Old 01-23-2015, 09:37 AM   #6
AlleyTrotter
Member
 
Registered: Jun 2002
Location: Coal Township PA
Distribution: Slackware64-15.0
Posts: 783

Rep: Reputation: 479Reputation: 479Reputation: 479Reputation: 479Reputation: 479
This article recently posted about using google drive and PDF files may be of interest.
http://www.makeuseof.com/tag/10-tips...-google-drive/
It has some interesting ways to OCR PDF's
HTH
John
 
Old 01-23-2015, 10:09 AM   #7
TobiSGD
Moderator
 
Registered: Dec 2009
Location: Germany
Distribution: Whatever fits the task best
Posts: 17,148
Blog Entries: 2

Rep: Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886Reputation: 4886
When you are lucky and the PDF contains actual text instead of images of the text you can directly extract the code without having to rely on OCR software.
 
Old 01-23-2015, 11:54 AM   #8
aikempshall
Member
 
Registered: Nov 2003
Location: Bristol, Britain
Distribution: Slackware
Posts: 900

Rep: Reputation: 153Reputation: 153
I've tried ocrad, gocr and tesseract.

Tesseract beats the other two by miles.

Alex
 
Old 01-23-2015, 02:28 PM   #9
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 4,442

Original Poster
Rep: Reputation: 76
But is it possible that all shops who transfer from a book into a computer file do it in PDF or other non-ASCII format, and by ASCII I mean plain ASCII text? All I want is to assemble the source!

EDIT: everything depends on the fact that the output PDF contains actual text, as Tobi says. For, what if I pay the shop and I bring back a file which, say, pdftotext does not render well, i.e., understandable for the assembler.

Last edited by stf92; 01-23-2015 at 02:52 PM.
 
Old 01-23-2015, 03:31 PM   #10
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Try 'pdftotext' first, it will extract the text if it is there. If not, use tesseract plus some image preprocessing to align the image and adjust levels.

As for which is best:
http://www.splitbrain.org/blog/2010-...are_comparison
It's a few years old, but they have all improved since then. Still, tesseract is the only serious OCR for Linux. In fact, it can be used to crack weak captchas.
 
1 members found this post helpful.
Old 01-23-2015, 06:31 PM   #11
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 4,442

Original Poster
Rep: Reputation: 76
Post LEFT BLANK by the author.

Last edited by stf92; 01-23-2015 at 08:52 PM.
 
Old 01-23-2015, 08:51 PM   #12
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 4,442

Original Poster
Rep: Reputation: 76
What about this PDF?

http://i1249.photobucket.com/albums/...ps150a2807.png

This is what I got at the shop. What would Tesseract make of my PDF? I'm in the while installing it, but presume it must not be a thing of a day.
Attached Thumbnails
Click image for larger version

Name:	Screenshot - 01232015 - 11:08:59 PM.jpg
Views:	42
Size:	38.0 KB
ID:	17425  
 
Old 01-23-2015, 09:02 PM   #13
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
Split it into two images, one for each page. Rotate the images so that the text is perfectly horizontal. Adjust the levels using GIMP so that the image is black text on a white background. Use tesseract and you may get good results. I hope this is not the original resolution, but it's probably not.

Also see:
https://github.com/Flameeyes/unpaper
https://code.google.com/p/linux-inte...-ocr-solution/
http://symmetrica.net/cuneiform-linux/yagf-en.html
 
2 members found this post helpful.
Old 01-24-2015, 02:08 AM   #14
stf92
Senior Member
 
Registered: Apr 2007
Location: Buenos Aires.
Distribution: Slackware
Posts: 4,442

Original Poster
Rep: Reputation: 76
I read the following:
Quote:
The build script defaults to use English, but this is easily
changed by passing an alternate value on the command line.
in the slackbuilds README:

http://slackbuilds.org/repository/14...ics/tesseract/

Is the default language, which is English, already in the package or should I download the package?
 
Old 01-24-2015, 02:31 AM   #15
Didier Spaier
LQ Addict
 
Registered: Nov 2008
Location: Paris, France
Distribution: Slint64-15.0
Posts: 11,064

Rep: Reputation: Disabled
There are several ways to answer yourself your question:
  • Read the README, including the part you quoted
  • Look at the SlackBuild to see what it does
  • After installation, to check what was installed, type:
    Code:
     less /var/log/packages/tesseract*

Last edited by Didier Spaier; 01-24-2015 at 10:47 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Optical character recognition software. stf92 Slackware 4 12-08-2013 09:49 AM
Optical Character Recognition in Slackware64 psynot Slackware 4 09-27-2009 03:08 PM
what are optical character recognition softwares in ubuntu ? shridhar005 Linux - Software 6 04-20-2009 10:54 AM
LXer: Optical Character Recognition With Tesseract OCR On Ubuntu 7.04 LXer Syndicated Linux News 0 08-30-2007 06:30 PM
What do you use for Optical Character Recognition? the who Linux - Software 0 09-23-2004 07:52 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware

All times are GMT -5. The time now is 06:19 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration