Is it possible to print a HTML document into PDF by conserving all links and anchors?

dariyoosh · 02-11-2013, 07:44 AM

Hello everybody,

OS: Fedora Core 17 (x86_64)

I have a very big HTML file which I want to print/convert into a PDF file. While the browser (firefox) is open, I just go to file > print and then I chose the PDF option in order to print the content into a PDF file. This works pretty well and fast (less than 40 seconds for producing a big PDF file about 36 MB).

The only problem is that there are a lot of hypertext links (or rather to say anchors as it is a big single HTML document) which obviously facilitates considerably the navigation in the document. Well, the problem is that non of these links are conserved once the PDF file has been created.

After a lot of Googling I found a tool called wkhtmltopdf which apparently does the job as I expect by conserving the links. I installed it successfully, yet once I launched the program to create the PDF file in command line mode, it has been running (on the 4th step resolving links) for more than 2 hours and therefore I was wondering whether it would finish the job and even if it does, such delay even for big documents doesn't seem to be reasonable for future uses (I will have many big HTML documents to be exported into PDF in the close future)

Consequently, I would like to ask your opinion, do you know any practical way under linux to print a single HTML document into a PDF file by conserving at the same time all the links and anchors?

Thanks in advance,
Dariyoosh

lykwydchykyn · 02-12-2013, 04:51 PM

You might try something like pandoc, though I don't know for certain if it preserves links. Alternately, you could maybe just open the HTML file in Libre/OpenOffice and export it to a PDF.

sag47 · 02-12-2013, 08:22 PM

I forgot about LibreOffice. You could try something like...

Code:

libreoffice --headless --convert-to pdf *.html

Usually that command is for *.odf but it's worth a try.

**EDIT**

I did some testing and found the conversion to be a little bit buggy. It cuts off the first word in my simple html document (test.html).

Code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>This is a test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>This is my title</h1>
<p>This is some text in the page</p>
<p><a href="http://www.gleske.net/">Visit Gleske Homepage</a></p>
<ul>
<li><a href="http://www.tldp.org/">Linux Documentation Project</a></li>
<li>This is some text in a bullet.</li>
<li><a href="http://www.gimp.org/">GIMP, An image manipulation program!</a></li>
</ul>
</body>
</html>

However, I was able to successfully convert the document using the LibreOffice API without any problems at all. Basically, you start libreoffice as a daemon and it will stay open in a headless environment. Then use the unoconv client to connect to the service for the conversion.

Headless file conversion using a LibreOffice API as a service

Start libreoffice as a foreground service.

Code:

soffice --nologo --headless --nofirststartwizard --accept="socket,host=127.0.0.1,port=2220,tcpNoDelay=1;urp"

Then use unoconv to connect to that service and use the API to convert the HTML file.

Code:

unoconv --connection "socket,host=127.0.0.1,port=2220,tcpNoDelay=1;urp;StarOffice.ComponentContext" -f pdf *.html

Links were preserved in my experiments.

One thing that is neat about that little experiment is that the conversion is a little faster than my original example because LibreOffice remains open as a service. The above example should work with OpenOffice using soffice.bin/soffice.

**EDIT2**

I made a blog post about this if you want to see some extra info about this method.

SAM

dariyoosh · 02-14-2013, 07:19 AM

Hi,

Thanks a lot for the help.

I exported a test HTML file into PDF with LibreOffice and in fact it worked (the links were conserved). However for very big files (almost 10000 pages) it halts after three hours without doing any thing.

The same happens by doing the command line method you provided, besides running unoconv gives me a Segmentation Fault error after running the script (Probably a dependency problem or a corrupted LibreOffice installation). Besides I did a test with a small html file and the links were not conserved (yet it was the case when I exported directly the file by using LibreOffice GUI)

Anyway, thank you very much both of you for your help and your time.

Regards,
Dariyoosh