SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I did the same was guided to slackbuilds tesseract. The question is now, is it any good a PDF file? That is, I begin with a PDF some pages of the book, which I can do three blocks from home. Does tesseract convert the PDF into a plain ASCII text file?
When you are lucky and the PDF contains actual text instead of images of the text you can directly extract the code without having to rely on OCR software.
But is it possible that all shops who transfer from a book into a computer file do it in PDF or other non-ASCII format, and by ASCII I mean plain ASCII text? All I want is to assemble the source!
EDIT: everything depends on the fact that the output PDF contains actual text, as Tobi says. For, what if I pay the shop and I bring back a file which, say, pdftotext does not render well, i.e., understandable for the assembler.
Try 'pdftotext' first, it will extract the text if it is there. If not, use tesseract plus some image preprocessing to align the image and adjust levels.
As for which is best: http://www.splitbrain.org/blog/2010-...are_comparison
It's a few years old, but they have all improved since then. Still, tesseract is the only serious OCR for Linux. In fact, it can be used to crack weak captchas.
Split it into two images, one for each page. Rotate the images so that the text is perfectly horizontal. Adjust the levels using GIMP so that the image is black text on a white background. Use tesseract and you may get good results. I hope this is not the original resolution, but it's probably not.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.