Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Distribution: Ubuntu 16.04 lts desk; Ubuntu 14.04 server
Posts: 366
Original Poster
Rep:
ilnli--
Thanks!
Glad you found this thread! I am still looking.
What I have done for the interim is to use a WinXP machine to support both the scanning and then the OCR work via AABBY Fine Reader. That has been satisfactory for me, but it is about the only daily reason I have to have any Windows based machine on the premises.
I'll have to check into the tesseract-ocr reliability currently. I cannot use the online service because of confidentiality needs. Do you use either?
I've used tesseract-ocr which is good but it runs on Linux and you have to do some tweaks to your image to get good results from it, as the stuff I work on is confidential so I mainly used the ocrconvert.com which works for me.
The best out of the box solution I've found is WatchOCR. It's a liveCD distro whose sole purpose is OCR. You put your images in a watch directory, and then a little script converts them into searchable PDFs. With some tweaking, it ought to be possible to save the text as well as the searchable PDF. For OCR it uses Curneiform, and layout analysis is done with ExactCode.
It's presumably possible to get Cuneiform and ExactCode installed on an existing system, though my understanding is that Cuneiform is difficult to get working.
Otherwise, there's OCROpus, which I haven't used, but seems promising.
I've use tesseract and ocrad in the past, and you can get decent quality out of them if the input quality is good. Also check unpaper: http://unpaper.berlios.de/
It will help the OCR work better. Sometimes you can also help it by using image filters like white balance and auto-levels, etc.
I don't think you can get as good as say AABBY, but it can be close if the input is good.
Distribution: Ubuntu 16.04 lts desk; Ubuntu 14.04 server
Posts: 366
Original Poster
Rep:
H_TeXMeX_H--
Thanks!
I had not heard of unpaper before. I see it is in the repos for Ubuntu.
All of this stuff together still looks a little much for our production environment. We scan some pages every day, maybe only a dozen or two on most days, but then there are some days when we need to scan a hundred or so in an hour. It is important to be able to do reliable searches on the scanned documents.
So far, it sounds like having the scanner attached to a WinXP machine using AABBY is still the easiest thing to have a non technical person running: she merely feeds the paper in, chooses in the gui whether to scan one side or two, then lets it rip. When all are scanned, she comes back to the AABBY main screen and saves them to a file after the OCR does its work. Pretty simple, and it allows some turning of pages and rearranging the order of pages.
I think CLI would blow a couple of my people away!
Oh well, another reason to keep at least one Windows box on the system for another year or two....
I tried tesseract but it was a disappointment. It couldn't OCR .png screenshot.
ABBYY finereader is probably the best, but not free. they even charge by page!
Distribution: Ubuntu 16.04 lts desk; Ubuntu 14.04 server
Posts: 366
Original Poster
Rep:
qrange--
Thanks for that tip about tesseract, qrange!
Have never had a per page charge from AABBY, so not sure what you're experiencing. It is a really good program. Just wish it were available in Linux. Hopefully some day soon--they have an SDK for Linux.
Distribution: Ubuntu 16.04 lts desk; Ubuntu 14.04 server
Posts: 366
Original Poster
Rep:
qrange--
OIC! Thanks!
It does appear to actually be an AABBY site. But I agree it is pretty pricey. At the 12,000 pages per year it prices out to 1.75 cents per page at current exchange rates.
That's a lot particularly since you can buy it for Windows and have it forever, for $400--about 2 years' cost of the Linux version.
Knowing that there is a Linux version gives me hope that there will be reasonable pricing and perhaps some other commercial products soon. And maybe a gui Linux version!
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.