[SOLVED] Turning a bunch of PDF files into HTML + JPG?
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I got several presentations as PDF files, each made by a different author using different tools — although mostly LibreOffice —, and need to turn all those into HTML + JPG to make contents easily available through a web site.
What open-source tool would you recommend to batch-process PDF files?
Thank you.
Code:
PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 33,87 x 19,05 cm
PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm
PDF Producer: Microsoft® PowerPoint® 2010
PDF Version: 1.5
Page Size: 25,4 x 19,05 cm
PDF Producer: GPL Ghostscript 9.22
PDF Version: 1.4
Page Size: 12,8 x 9,6 cm
PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm
PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm
PDF Producer: LibreOffice 5.1
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm
PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 21,01 x 29,7 cm (A4)
"pdf2htmlEX is no longer under active development. New maintainers are wanted."
Latest commit f12fc15 on Jan 16, 2017
?
--
Edit: The fact that it's currently in stand-by is unfortunate since I'm not the only one having a problem, where displayed text is missing some characters sometimes ("applicaon" instead of "application"), while the text is fine when copy/pasting it elsewhere.
I tried "--tounicode 0 --optimize-text 1 --space-as-offset 0 --correct-text-visibility 1", but some words are still wrongly displayed ("constitue" is displayed as "constue", "association" becomes "associaon", etc.)
Hard to tell how much of this problem is due to issues in the input PDF or pdf2htmlex. YMMV.
Also, the package is "pdf2htmlex" while the application is "pdf2htmlEX":-/
I'm not the only one having a problem, where displayed text is missing some characters sometimes ("applicaon" instead of "application"), while the text is fine when copy/pasting it elsewhere.
I tried "--tounicode 0 --optimize-text 1 --space-as-offset 0 --correct-text-visibility 1", but some words are still wrongly displayed ("constitue" is displayed as "constue", "association" becomes "associaon", etc.)
ok, so you tried it.
this sort of error would suggest that the pdf doesn't use text internally, but that the text is actually some sort of image, and is translated into text via OCR?
"The above issue occurs only in - webkit web browsers like chrome and safari - which provides support for ligatures - whereas browser like firefox does not. A ligature is a combination of two or more letters joined as a single glyph. This issue with missing characters is due to ligature support provided by these modern browsers" https://github.com/coolwanglu/pdf2htmlEX/issues/634
Regardless, it was pretty good for a first try, considering the contributors were not told their work could be merged into a common document, and hence, used different tools in different ways. I'll write a post-mortem and recommend best practises for the next time.
you should be aware that PDF can encapsulate all sorts of stuff; if the text is captured as an image, translating that into html text will never work very well.
From now on, I'll try to get people to hand the original file, in Word or LibreOffice, so I can merge the files and turn the whole thing into HTML+JPG with less chance of glitches.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.