LinuxQuestions.org - Are files produced by gscan2pdf suitably searchable?

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - Are files produced by gscan2pdf suitably searchable? (https://www.linuxquestions.org/questions/linux-software-2/are-files-produced-by-gscan2pdf-suitably-searchable-4175673203/)

rblampain

04-14-2020 09:28 AM

Are files produced by gscan2pdf suitably searchable?

I am asked to send files as "searchable PDF files".
I do not know much about PDF files or, except faxing and printing them, what one can do with them. It looks like my installed "gscan2pdf" in Debian 9 produces searchable files since "Atril document viewer" tells me that unique words are found when I type them in its search field targeting a PDF file created with "gscan2pdf" although Atril does not point to the word or expression.
I anticipate whoever is going to read the sent files have more elaborated software to search, comment, etc (I only have a vague idea of the possibilities) and my question is: will "gscan2pdf" produce files suitable for such job or should I look for other software?

Thank you for your help.

business_kid

04-14-2020 11:31 AM

Quote:

Originally Posted by rblampain

my question is: will "gscan2pdf" produce files suitable for such job?

In a word, no.

A scan is a picture image. You can have an image of every page in a pdf, but it's not text searchable.

The first thing I would suggest is get the real files, not prints of them. If you have to work with printed pages, the software you need is

Scan --> high res image(600 dpi+)--> Optical Character Recognition (OCR) and save in some text format.

Then edit the text format, & correct errors with a word processor. Finally, if you must use pdfs, export to pdf.

In Open Source, tesseract is probably the best OCR. High resolution makes a huge difference. You need tesseract-4.x and the later the better. Tesseract-4.0 released a new ocr engine.

In M$Windoze, the best closed source app is Abbyy, and it's by far the best overall. But you pay. The last time I needed OCR, tesseract-4.0 was beta, and Abbyy had released a linux version which they gave out on one month free trial. I got my work done inside a month, so that was ok. It was klunky but it did the business. I was working off photographs then, I've a 1200 dpi scanner now, so I'm sure tesseract would do it. Scanning takes huge space at high res, so make space.

I had 40 pages of my Dad's play typed on an old Underwood in the 60s, and everything (even Abbyy) performed pretty poorly on it. Editing was slow. But I was able to send a pdf to my family.

remmilou

04-14-2020 02:18 PM

In a word... yes
Gscan2pdf does the job for you.
If you install Tesseract from the repositories and the right languages.
Gscan2pdf gives you choices to ocr or not and with which program. I also installed GOCR and that is also a choice in Gscan2pdf.
Tesseract gives me the best results, by far.
Gscan2pdf is a Perl script (you can read and modify it...) that does the a) scan job (based on SANE and with choice for density "dpi") b) optional document cleaning c) optional OCR (language for choice) and d) save as PDF ( machine readable or picture), TIFF, PNG, text...
You can even open a non machine readable PDF and add ocr as a layer and save it as a machine readable PDF. But there are much easier utilities to achieve the last.
Tesseract can compete with the best ocr software. It is used in lots of commercial products. The quality depends highly on the input, the scanner and the settings (experiment a bit).

jefro

04-14-2020 07:24 PM

https://unix.stackexchange.com/quest...within-the-pdf

However I've tried a number of free things on one not for profit book I wanted to search through. Ended up using the office multifuntion that had the best results.

You will have to check word for word on any choice unless you used ocrx types of font

remmilou

04-15-2020 02:07 AM

Quote:

Originally Posted by jefro (Post 6111770)

Yes, "office multifunctions" generally do a very good job. But not affordable for the average home user.
It is much cheaper to buy a (very) good scanner. That's what makes the difference here. Office multifunction do not necessarily do better ocr.

jefro

04-15-2020 03:54 PM

Just notes.

I agree but I had access to a medium office HP model. It took a few moments for the file to be sent, thought I messed up. The results were the best of all my efforts. I even tried my phone. Oddly my old windows phone worked best. You'd think that google would have offered more support for their product.

It may be possible for some folks to access print and copy stores locally if all else fails.

I'll agree that almost all newish home scanners can easily scan a document in high enough quality. I've worked with character recognition for a number of decades. There are generally a few ways that they perform their tasks. The best way to start is with fixed standard fonts that are clearly separated. If the system was designed to read orcx type font then that may be the highest quality. Most programs work at it three ways. One is bitmap the other is profiles and the other is testing against what it could be in a word and sentence. The results from each engine have a percentage result. Generally the best percentage will be selected. If the word score is higher than each letter score for example. They have to find the text, locate line by line and then find each character. Ability to correct skew and kerning is important on common documents. Images may contain features that end up as text.

TB0ne

04-15-2020 04:59 PM

Quote:

Originally Posted by rblampain (Post 6111538)

My first question is how are you originally producing the content?? Libreoffice can export/print to a PDF file directly, and there are several utilities to create PDF's from various electronic formats. Do you *HAVE* to scan the actual, physical pages?

rblampain

04-19-2020 12:59 AM

Quote:

Do you *HAVE* to scan the actual, physical pages?

No. The pages required are (or are to be) typed and saved as .txt or .html but the receiver (government) does not accept files in those formats. Under this scheme, I need to check that any PDF file I create is "searchable" which I can only guess.

ondoho

04-19-2020 03:54 AM

Quote:

Originally Posted by rblampain (Post 6113341)

The pages required are (or are to be) typed and saved as .txt or .html

In that case everything that was said about OCR & tesseract is moot, and gscan2pdf is the wrong tool.
You simply need to convert these text files to PDF. Various programs are available, e.g. html2pdf, wkhtmltopdf, pandoc...

All times are GMT -5. The time now is 12:35 PM.