Optical character recognition software.

stf92 · 01-22-2015, 02:21 PM

Hi: any known good OCR program for Linux? The intended goal is to make an assembler source file from the source that is printed in a book.

bassmadrigal · 01-22-2015, 02:34 PM

A quick search of "OCR linux" on Google returned this as the first result.

https://help.ubuntu.com/community/OCR

And there's several on slackbuilds.org

http://slackbuilds.org/result/?search=ocr&sv=14.1

Have you looked into any of these? If you've tried some and they didn't work for your needs, that could help.

stf92 · 01-22-2015, 02:41 PM

I did the same was guided to slackbuilds tesseract. The question is now, is it any good a PDF file? That is, I begin with a PDF some pages of the book, which I can do three blocks from home. Does tesseract convert the PDF into a plain ASCII text file?

ttk · 01-22-2015, 02:52 PM

When I last looked at it (a few years ago), gocr was the best, with ocrad a distant second.

bassmadrigal · 01-22-2015, 03:25 PM

Quote:

Originally Posted by stf92

Does tesseract convert the PDF into a plain ASCII text file?

No, the input has to be an image.

Quote:

...it can read a wide variety of image formats and convert them to text in over 60 languages

SOURCE: http://code.google.com/p/tesseract-ocr/

You can convert the pdf to an image using imagemagick (included in a FULL Slackware install).

Code:

convert -density 600 input.pdf output.tif

I'm not sure about gocr since the site is blocked at work.

AlleyTrotter · 01-23-2015, 09:37 AM

This article recently posted about using google drive and PDF files may be of interest.
http://www.makeuseof.com/tag/10-tips...-google-drive/
It has some interesting ways to OCR PDF's
HTH
John

TobiSGD · 01-23-2015, 10:09 AM

When you are lucky and the PDF contains actual text instead of images of the text you can directly extract the code without having to rely on OCR software.

aikempshall · 01-23-2015, 11:54 AM

I've tried ocrad, gocr and tesseract.

Tesseract beats the other two by miles.

Alex

stf92 · 01-23-2015, 02:28 PM

But is it possible that all shops who transfer from a book into a computer file do it in PDF or other non-ASCII format, and by ASCII I mean plain ASCII text? All I want is to assemble the source!

EDIT: everything depends on the fact that the output PDF contains actual text, as Tobi says. For, what if I pay the shop and I bring back a file which, say, pdftotext does not render well, i.e., understandable for the assembler.

metaschima · 01-23-2015, 03:31 PM

Try 'pdftotext' first, it will extract the text if it is there. If not, use tesseract plus some image preprocessing to align the image and adjust levels.

As for which is best:
http://www.splitbrain.org/blog/2010-...are_comparison
It's a few years old, but they have all improved since then. Still, tesseract is the only serious OCR for Linux. In fact, it can be used to crack weak captchas.

stf92 · 01-23-2015, 06:31 PM

Post LEFT BLANK by the author.

stf92 · 01-23-2015, 08:51 PM

What about this PDF?

http://i1249.photobucket.com/albums/...ps150a2807.png

This is what I got at the shop. What would Tesseract make of my PDF? I'm in the while installing it, but presume it must not be a thing of a day.

metaschima · 01-23-2015, 09:02 PM

Split it into two images, one for each page. Rotate the images so that the text is perfectly horizontal. Adjust the levels using GIMP so that the image is black text on a white background. Use tesseract and you may get good results. I hope this is not the original resolution, but it's probably not.

Also see:
https://github.com/Flameeyes/unpaper
https://code.google.com/p/linux-inte...-ocr-solution/
http://symmetrica.net/cuneiform-linux/yagf-en.html

stf92 · 01-24-2015, 02:08 AM

I read the following:

Quote:

The build script defaults to use English, but this is easily
changed by passing an alternate value on the command line.

in the slackbuilds README:

http://slackbuilds.org/repository/14...ics/tesseract/

Is the default language, which is English, already in the package or should I download the package?

Didier Spaier · 01-24-2015, 02:31 AM

There are several ways to answer yourself your question:

Read the README, including the part you quoted
Look at the SlackBuild to see what it does
After installation, to check what was installed, type:
Code:
```
 less /var/log/packages/tesseract*
```