![]() |
Are files produced by gscan2pdf suitably searchable?
I am asked to send files as "searchable PDF files".
I do not know much about PDF files or, except faxing and printing them, what one can do with them. It looks like my installed "gscan2pdf" in Debian 9 produces searchable files since "Atril document viewer" tells me that unique words are found when I type them in its search field targeting a PDF file created with "gscan2pdf" although Atril does not point to the word or expression. I anticipate whoever is going to read the sent files have more elaborated software to search, comment, etc (I only have a vague idea of the possibilities) and my question is: will "gscan2pdf" produce files suitable for such job or should I look for other software? Thank you for your help. |
Quote:
A scan is a picture image. You can have an image of every page in a pdf, but it's not text searchable. The first thing I would suggest is get the real files, not prints of them. If you have to work with printed pages, the software you need is Scan --> high res image(600 dpi+)--> Optical Character Recognition (OCR) and save in some text format. Then edit the text format, & correct errors with a word processor. Finally, if you must use pdfs, export to pdf. In Open Source, tesseract is probably the best OCR. High resolution makes a huge difference. You need tesseract-4.x and the later the better. Tesseract-4.0 released a new ocr engine. In M$Windoze, the best closed source app is Abbyy, and it's by far the best overall. But you pay. The last time I needed OCR, tesseract-4.0 was beta, and Abbyy had released a linux version which they gave out on one month free trial. I got my work done inside a month, so that was ok. It was klunky but it did the business. I was working off photographs then, I've a 1200 dpi scanner now, so I'm sure tesseract would do it. Scanning takes huge space at high res, so make space. I had 40 pages of my Dad's play typed on an old Underwood in the 60s, and everything (even Abbyy) performed pretty poorly on it. Editing was slow. But I was able to send a pdf to my family. |
In a word... yes
Gscan2pdf does the job for you. If you install Tesseract from the repositories and the right languages. Gscan2pdf gives you choices to ocr or not and with which program. I also installed GOCR and that is also a choice in Gscan2pdf. Tesseract gives me the best results, by far. Gscan2pdf is a Perl script (you can read and modify it...) that does the a) scan job (based on SANE and with choice for density "dpi") b) optional document cleaning c) optional OCR (language for choice) and d) save as PDF ( machine readable or picture), TIFF, PNG, text... You can even open a non machine readable PDF and add ocr as a layer and save it as a machine readable PDF. But there are much easier utilities to achieve the last. Tesseract can compete with the best ocr software. It is used in lots of commercial products. The quality depends highly on the input, the scanner and the settings (experiment a bit). |
https://unix.stackexchange.com/quest...within-the-pdf
However I've tried a number of free things on one not for profit book I wanted to search through. Ended up using the office multifuntion that had the best results. You will have to check word for word on any choice unless you used ocrx types of font |
Quote:
It is much cheaper to buy a (very) good scanner. That's what makes the difference here. Office multifunction do not necessarily do better ocr. |
Just notes.
I agree but I had access to a medium office HP model. It took a few moments for the file to be sent, thought I messed up. The results were the best of all my efforts. I even tried my phone. Oddly my old windows phone worked best. You'd think that google would have offered more support for their product. It may be possible for some folks to access print and copy stores locally if all else fails. I'll agree that almost all newish home scanners can easily scan a document in high enough quality. I've worked with character recognition for a number of decades. There are generally a few ways that they perform their tasks. The best way to start is with fixed standard fonts that are clearly separated. If the system was designed to read orcx type font then that may be the highest quality. Most programs work at it three ways. One is bitmap the other is profiles and the other is testing against what it could be in a word and sentence. The results from each engine have a percentage result. Generally the best percentage will be selected. If the word score is higher than each letter score for example. They have to find the text, locate line by line and then find each character. Ability to correct skew and kerning is important on common documents. Images may contain features that end up as text. |
Quote:
|
Quote:
|
Quote:
You simply need to convert these text files to PDF. Various programs are available, e.g. html2pdf, wkhtmltopdf, pandoc... |
All times are GMT -5. The time now is 12:35 PM. |