LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 04-14-2020, 09:28 AM   #1
rblampain
Senior Member
 
Registered: Aug 2004
Location: Western Australia
Distribution: Debian 11
Posts: 1,291

Rep: Reputation: 52
Are files produced by gscan2pdf suitably searchable?


I am asked to send files as "searchable PDF files".
I do not know much about PDF files or, except faxing and printing them, what one can do with them. It looks like my installed "gscan2pdf" in Debian 9 produces searchable files since "Atril document viewer" tells me that unique words are found when I type them in its search field targeting a PDF file created with "gscan2pdf" although Atril does not point to the word or expression.
I anticipate whoever is going to read the sent files have more elaborated software to search, comment, etc (I only have a vague idea of the possibilities) and my question is: will "gscan2pdf" produce files suitable for such job or should I look for other software?

Thank you for your help.
 
Old 04-14-2020, 11:31 AM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,411

Rep: Reputation: 2338Reputation: 2338Reputation: 2338Reputation: 2338Reputation: 2338Reputation: 2338Reputation: 2338Reputation: 2338Reputation: 2338Reputation: 2338Reputation: 2338
Quote:
Originally Posted by rblampain
my question is: will "gscan2pdf" produce files suitable for such job?
In a word, no.

A scan is a picture image. You can have an image of every page in a pdf, but it's not text searchable.

The first thing I would suggest is get the real files, not prints of them. If you have to work with printed pages, the software you need is

Scan --> high res image(600 dpi+)--> Optical Character Recognition (OCR) and save in some text format.

Then edit the text format, & correct errors with a word processor. Finally, if you must use pdfs, export to pdf.

In Open Source, tesseract is probably the best OCR. High resolution makes a huge difference. You need tesseract-4.x and the later the better. Tesseract-4.0 released a new ocr engine.

In M$Windoze, the best closed source app is Abbyy, and it's by far the best overall. But you pay. The last time I needed OCR, tesseract-4.0 was beta, and Abbyy had released a linux version which they gave out on one month free trial. I got my work done inside a month, so that was ok. It was klunky but it did the business. I was working off photographs then, I've a 1200 dpi scanner now, so I'm sure tesseract would do it. Scanning takes huge space at high res, so make space.

I had 40 pages of my Dad's play typed on an old Underwood in the 60s, and everything (even Abbyy) performed pretty poorly on it. Editing was slow. But I was able to send a pdf to my family.

Last edited by business_kid; 04-14-2020 at 11:34 AM.
 
1 members found this post helpful.
Old 04-14-2020, 02:18 PM   #3
remmilou
Member
 
Registered: Mar 2010
Location: Amsterdam
Distribution: MX Linux (21)/ XFCE
Posts: 212

Rep: Reputation: 69
In a word... yes
Gscan2pdf does the job for you.
If you install Tesseract from the repositories and the right languages.
Gscan2pdf gives you choices to ocr or not and with which program. I also installed GOCR and that is also a choice in Gscan2pdf.
Tesseract gives me the best results, by far.
Gscan2pdf is a Perl script (you can read and modify it...) that does the a) scan job (based on SANE and with choice for density "dpi") b) optional document cleaning c) optional OCR (language for choice) and d) save as PDF ( machine readable or picture), TIFF, PNG, text...
You can even open a non machine readable PDF and add ocr as a layer and save it as a machine readable PDF. But there are much easier utilities to achieve the last.
Tesseract can compete with the best ocr software. It is used in lots of commercial products. The quality depends highly on the input, the scanner and the settings (experiment a bit).

Last edited by remmilou; 04-15-2020 at 02:02 AM. Reason: small mistake...
 
1 members found this post helpful.
Old 04-14-2020, 07:24 PM   #4
jefro
Moderator
 
Registered: Mar 2008
Posts: 22,008

Rep: Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629
https://unix.stackexchange.com/quest...within-the-pdf

However I've tried a number of free things on one not for profit book I wanted to search through. Ended up using the office multifuntion that had the best results.

You will have to check word for word on any choice unless you used ocrx types of font
 
1 members found this post helpful.
Old 04-15-2020, 02:07 AM   #5
remmilou
Member
 
Registered: Mar 2010
Location: Amsterdam
Distribution: MX Linux (21)/ XFCE
Posts: 212

Rep: Reputation: 69
Quote:
Originally Posted by jefro View Post
https://unix.stackexchange.com/quest...within-the-pdf

However I've tried a number of free things on one not for profit book I wanted to search through. Ended up using the office multifuntion that had the best results.

You will have to check word for word on any choice unless you used ocrx types of font
Yes, "office multifunctions" generally do a very good job. But not affordable for the average home user.
It is much cheaper to buy a (very) good scanner. That's what makes the difference here. Office multifunction do not necessarily do better ocr.
 
1 members found this post helpful.
Old 04-15-2020, 03:54 PM   #6
jefro
Moderator
 
Registered: Mar 2008
Posts: 22,008

Rep: Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629
Just notes.

I agree but I had access to a medium office HP model. It took a few moments for the file to be sent, thought I messed up. The results were the best of all my efforts. I even tried my phone. Oddly my old windows phone worked best. You'd think that google would have offered more support for their product.

It may be possible for some folks to access print and copy stores locally if all else fails.

I'll agree that almost all newish home scanners can easily scan a document in high enough quality. I've worked with character recognition for a number of decades. There are generally a few ways that they perform their tasks. The best way to start is with fixed standard fonts that are clearly separated. If the system was designed to read orcx type font then that may be the highest quality. Most programs work at it three ways. One is bitmap the other is profiles and the other is testing against what it could be in a word and sentence. The results from each engine have a percentage result. Generally the best percentage will be selected. If the word score is higher than each letter score for example. They have to find the text, locate line by line and then find each character. Ability to correct skew and kerning is important on common documents. Images may contain features that end up as text.
 
1 members found this post helpful.
Old 04-15-2020, 04:59 PM   #7
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,712

Rep: Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972
Quote:
Originally Posted by rblampain View Post
I am asked to send files as "searchable PDF files".
I do not know much about PDF files or, except faxing and printing them, what one can do with them. It looks like my installed "gscan2pdf" in Debian 9 produces searchable files since "Atril document viewer" tells me that unique words are found when I type them in its search field targeting a PDF file created with "gscan2pdf" although Atril does not point to the word or expression.
I anticipate whoever is going to read the sent files have more elaborated software to search, comment, etc (I only have a vague idea of the possibilities) and my question is: will "gscan2pdf" produce files suitable for such job or should I look for other software?

Thank you for your help.
My first question is how are you originally producing the content?? Libreoffice can export/print to a PDF file directly, and there are several utilities to create PDF's from various electronic formats. Do you *HAVE* to scan the actual, physical pages?
 
1 members found this post helpful.
Old 04-19-2020, 12:59 AM   #8
rblampain
Senior Member
 
Registered: Aug 2004
Location: Western Australia
Distribution: Debian 11
Posts: 1,291

Original Poster
Rep: Reputation: 52
Quote:
Do you *HAVE* to scan the actual, physical pages?
No. The pages required are (or are to be) typed and saved as .txt or .html but the receiver (government) does not accept files in those formats. Under this scheme, I need to check that any PDF file I create is "searchable" which I can only guess.
 
Old 04-19-2020, 03:54 AM   #9
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by rblampain View Post
The pages required are (or are to be) typed and saved as .txt or .html
In that case everything that was said about OCR & tesseract is moot, and gscan2pdf is the wrong tool.
You simply need to convert these text files to PDF. Various programs are available, e.g. html2pdf, wkhtmltopdf, pandoc...
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Upgraded to 14.04 lts and no longer finds my Gscan2pdf scanner FlyerDan69 Linux - Hardware 5 08-18-2014 03:04 PM
LXer: gscan2pdf 1.1.2 Brings Various Improvements LXer Syndicated Linux News 0 02-12-2013 10:20 PM
gscan2pdf won't multi-page scan doxieman40228 Linux - Software 0 09-14-2011 12:30 AM
LXer: gscan2pdf - Scan multiple Documents, import images to PDF & DjVu LXer Syndicated Linux News 0 08-28-2008 07:41 AM
difference between distro produced by group vs. produced by single person lostsoul Linux - General 2 04-08-2004 01:29 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:02 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration