[SOLVED] Turning a bunch of PDF files into HTML + JPG?

littlebigman · 05-10-2018, 05:06 PM

Hello,

I got several presentations as PDF files, each made by a different author using different tools — although mostly LibreOffice —, and need to turn all those into HTML + JPG to make contents easily available through a web site.

What open-source tool would you recommend to batch-process PDF files?

Thank you.

Code:

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 33,87 x 19,05 cm

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm

PDF Producer: Microsoft® PowerPoint® 2010
PDF Version: 1.5
Page Size: 25,4 x 19,05 cm

PDF Producer: GPL Ghostscript 9.22
PDF Version: 1.4
Page Size: 12,8 x 9,6 cm

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm

PDF Producer: LibreOffice 5.1
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 21,01 x 29,7 cm (A4)

goldennuggets · 05-10-2018, 05:10 PM

pfd2htmlEX
http://coolwanglu.github.io/pdf2htmlEX/

littlebigman · 05-10-2018, 05:17 PM

"pdf2htmlEX is no longer under active development. New maintainers are wanted."

Latest commit f12fc15 on Jan 16, 2017

?

--

Edit: The fact that it's currently in stand-by is unfortunate since I'm not the only one having a problem, where displayed text is missing some characters sometimes ("applicaon" instead of "application"), while the text is fine when copy/pasting it elsewhere.

I tried "--tounicode 0 --optimize-text 1 --space-as-offset 0 --correct-text-visibility 1", but some words are still wrongly displayed ("constitue" is displayed as "constue", "association" becomes "associaon", etc.)

Hard to tell how much of this problem is due to issues in the input PDF or pdf2htmlex. YMMV.

Also, the package is "pdf2htmlex" while the application is "pdf2htmlEX":-/

Code:

~# apt-get install pdf2htmlEX
E: Unable to locate package pdf2htmlEX

~# apt-get install pdf2htmlex

Thank you.

ondoho · 05-11-2018, 03:07 AM

Quote:

Originally Posted by littlebigman

"pdf2htmlEX is no longer under active development. New maintainers are wanted."

you can still try it.
it's not a security mission critical app, is it.

there's also pandoc, but i don't see an easy way to include images.
what did YOUR research reveal so far?

ondoho · 05-11-2018, 03:09 AM

Quote:

Originally Posted by littlebigman

I'm not the only one having a problem, where displayed text is missing some characters sometimes ("applicaon" instead of "application"), while the text is fine when copy/pasting it elsewhere.

I tried "--tounicode 0 --optimize-text 1 --space-as-offset 0 --correct-text-visibility 1", but some words are still wrongly displayed ("constitue" is displayed as "constue", "association" becomes "associaon", etc.)

ok, so you tried it.
this sort of error would suggest that the pdf doesn't use text internally, but that the text is actually some sort of image, and is translated into text via OCR?

littlebigman · 05-11-2018, 08:32 AM

Apparently, it's a browser issue:

"The above issue occurs only in - webkit web browsers like chrome and safari - which provides support for ligatures - whereas browser like firefox does not. A ligature is a combination of two or more letters joined as a single glyph. This issue with missing characters is due to ligature support provided by these modern browsers"
https://github.com/coolwanglu/pdf2htmlEX/issues/634

Regardless, it was pretty good for a first try, considering the contributors were not told their work could be merged into a common document, and hence, used different tools in different ways. I'll write a post-mortem and recommend best practises for the next time.

Thank you.

ondoho · 05-11-2018, 01:58 PM

you should be aware that PDF can encapsulate all sorts of stuff; if the text is captured as an image, translating that into html text will never work very well.

littlebigman · 05-12-2018, 06:19 AM

Good to know.

From now on, I'll try to get people to hand the original file, in Word or LibreOffice, so I can merge the files and turn the whole thing into HTML+JPG with less chance of glitches.