LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 05-10-2018, 05:06 PM   #1
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 660

Rep: Reputation: 35
Question Turning a bunch of PDF files into HTML + JPG?


Hello,

I got several presentations as PDF files, each made by a different author using different tools — although mostly LibreOffice —, and need to turn all those into HTML + JPG to make contents easily available through a web site.

What open-source tool would you recommend to batch-process PDF files?

Thank you.

Code:
PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 33,87 x 19,05 cm

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm

PDF Producer: Microsoft® PowerPoint® 2010
PDF Version: 1.5
Page Size: 25,4 x 19,05 cm

PDF Producer: GPL Ghostscript 9.22
PDF Version: 1.4
Page Size: 12,8 x 9,6 cm

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm

PDF Producer: LibreOffice 5.1
PDF Version: 1.4
Page Size: 25,4 x 19,05 cm

PDF Producer: LibreOffice 6.0
PDF Version: 1.4
Page Size: 21,01 x 29,7 cm (A4)
 
Old 05-10-2018, 05:10 PM   #2
goldennuggets
Member
 
Registered: Feb 2003
Location: USA
Distribution: Kubuntu, Manjaro
Posts: 239

Rep: Reputation: 24
pfd2htmlEX
http://coolwanglu.github.io/pdf2htmlEX/
 
Old 05-10-2018, 05:17 PM   #3
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 660

Original Poster
Rep: Reputation: 35
"pdf2htmlEX is no longer under active development. New maintainers are wanted."

Latest commit f12fc15 on Jan 16, 2017

?

--

Edit: The fact that it's currently in stand-by is unfortunate since I'm not the only one having a problem, where displayed text is missing some characters sometimes ("applicaon" instead of "application"), while the text is fine when copy/pasting it elsewhere.

I tried "--tounicode 0 --optimize-text 1 --space-as-offset 0 --correct-text-visibility 1", but some words are still wrongly displayed ("constitue" is displayed as "constue", "association" becomes "associaon", etc.)

Hard to tell how much of this problem is due to issues in the input PDF or pdf2htmlex. YMMV.

Also, the package is "pdf2htmlex" while the application is "pdf2htmlEX":-/

Code:
~# apt-get install pdf2htmlEX
E: Unable to locate package pdf2htmlEX

~# apt-get install pdf2htmlex
Thank you.
Attached Thumbnails
Click image for larger version

Name:	pdf2htmlex.display.bugs.png
Views:	27
Size:	238.3 KB
ID:	27591  

Last edited by littlebigman; 05-11-2018 at 03:07 AM.
 
Old 05-11-2018, 03:07 AM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by littlebigman View Post
"pdf2htmlEX is no longer under active development. New maintainers are wanted."
you can still try it.
it's not a security mission critical app, is it.

there's also pandoc, but i don't see an easy way to include images.
what did YOUR research reveal so far?
 
Old 05-11-2018, 03:09 AM   #5
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by littlebigman View Post
I'm not the only one having a problem, where displayed text is missing some characters sometimes ("applicaon" instead of "application"), while the text is fine when copy/pasting it elsewhere.

I tried "--tounicode 0 --optimize-text 1 --space-as-offset 0 --correct-text-visibility 1", but some words are still wrongly displayed ("constitue" is displayed as "constue", "association" becomes "associaon", etc.)
ok, so you tried it.
this sort of error would suggest that the pdf doesn't use text internally, but that the text is actually some sort of image, and is translated into text via OCR?
 
Old 05-11-2018, 08:32 AM   #6
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 660

Original Poster
Rep: Reputation: 35
Apparently, it's a browser issue:

"The above issue occurs only in - webkit web browsers like chrome and safari - which provides support for ligatures - whereas browser like firefox does not. A ligature is a combination of two or more letters joined as a single glyph. This issue with missing characters is due to ligature support provided by these modern browsers"
https://github.com/coolwanglu/pdf2htmlEX/issues/634

Regardless, it was pretty good for a first try, considering the contributors were not told their work could be merged into a common document, and hence, used different tools in different ways. I'll write a post-mortem and recommend best practises for the next time.

Thank you.
 
Old 05-11-2018, 01:58 PM   #7
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
you should be aware that PDF can encapsulate all sorts of stuff; if the text is captured as an image, translating that into html text will never work very well.
 
Old 05-12-2018, 06:19 AM   #8
littlebigman
Member
 
Registered: Aug 2008
Location: France
Posts: 660

Original Poster
Rep: Reputation: 35
Good to know.

From now on, I'll try to get people to hand the original file, in Word or LibreOffice, so I can merge the files and turn the whole thing into HTML+JPG with less chance of glitches.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Merge Of Html Files Into A Single Html (or Pdf) fiomba Linux - Software 10 05-11-2018 11:28 AM
convert multiple jpg files into pdf Brocolli Linux - Software 5 05-08-2013 09:43 AM
Samba users can't execute some files (pdf/doc/jpg) deathsfriend99 Linux - Server 4 03-02-2009 09:48 AM
CMD line tool for PDF -> HTML or JPG/PNG/GIF ilhbutshm Linux - Software 5 10-23-2004 04:18 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 01:59 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration