LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-11-2013, 07:44 AM   #1
dariyoosh
LQ Newbie
 
Registered: Mar 2009
Location: Iran (Tehran)
Posts: 12

Rep: Reputation: 0
Question Is it possible to print a HTML document into PDF by conserving all links and anchors?


Hello everybody,


OS: Fedora Core 17 (x86_64)


I have a very big HTML file which I want to print/convert into a PDF file. While the browser (firefox) is open, I just go to file > print and then I chose the PDF option in order to print the content into a PDF file. This works pretty well and fast (less than 40 seconds for producing a big PDF file about 36 MB).

The only problem is that there are a lot of hypertext links (or rather to say anchors as it is a big single HTML document) which obviously facilitates considerably the navigation in the document. Well, the problem is that non of these links are conserved once the PDF file has been created.

After a lot of Googling I found a tool called wkhtmltopdf which apparently does the job as I expect by conserving the links. I installed it successfully, yet once I launched the program to create the PDF file in command line mode, it has been running (on the 4th step resolving links) for more than 2 hours and therefore I was wondering whether it would finish the job and even if it does, such delay even for big documents doesn't seem to be reasonable for future uses (I will have many big HTML documents to be exported into PDF in the close future)

Consequently, I would like to ask your opinion, do you know any practical way under linux to print a single HTML document into a PDF file by conserving at the same time all the links and anchors?

Thanks in advance,
Dariyoosh
 
Old 02-12-2013, 04:51 PM   #2
lykwydchykyn
Member
 
Registered: Mar 2006
Location: Tennessee, USA
Distribution: Debian, Ubuntu
Posts: 135

Rep: Reputation: 36
You might try something like pandoc, though I don't know for certain if it preserves links. Alternately, you could maybe just open the HTML file in Libre/OpenOffice and export it to a PDF.
 
Old 02-12-2013, 08:22 PM   #3
sag47
Senior Member
 
Registered: Sep 2009
Location: Raleigh, NC
Distribution: Ubuntu, PopOS, Raspbian
Posts: 1,899
Blog Entries: 36

Rep: Reputation: 477Reputation: 477Reputation: 477Reputation: 477Reputation: 477
I forgot about LibreOffice. You could try something like...
Code:
libreoffice --headless --convert-to pdf *.html
Usually that command is for *.odf but it's worth a try.

**EDIT**

I did some testing and found the conversion to be a little bit buggy. It cuts off the first word in my simple html document (test.html).

Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>This is a test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<h1>This is my title</h1>
<p>This is some text in the page</p>
<p><a href="http://www.gleske.net/">Visit Gleske Homepage</a></p>
<ul>
<li><a href="http://www.tldp.org/">Linux Documentation Project</a></li>
<li>This is some text in a bullet.</li>
<li><a href="http://www.gimp.org/">GIMP, An image manipulation program!</a></li>
</ul>
</body>
</html>
However, I was able to successfully convert the document using the LibreOffice API without any problems at all. Basically, you start libreoffice as a daemon and it will stay open in a headless environment. Then use the unoconv client to connect to the service for the conversion.

Headless file conversion using a LibreOffice API as a service

Start libreoffice as a foreground service.
Code:
soffice --nologo --headless --nofirststartwizard --accept="socket,host=127.0.0.1,port=2220,tcpNoDelay=1;urp"
Then use unoconv to connect to that service and use the API to convert the HTML file.
Code:
unoconv --connection "socket,host=127.0.0.1,port=2220,tcpNoDelay=1;urp;StarOffice.ComponentContext" -f pdf *.html
Links were preserved in my experiments.

One thing that is neat about that little experiment is that the conversion is a little faster than my original example because LibreOffice remains open as a service. The above example should work with OpenOffice using soffice.bin/soffice.

**EDIT2**

I made a blog post about this if you want to see some extra info about this method.

SAM

Last edited by sag47; 02-12-2013 at 10:17 PM.
 
Old 02-14-2013, 07:19 AM   #4
dariyoosh
LQ Newbie
 
Registered: Mar 2009
Location: Iran (Tehran)
Posts: 12

Original Poster
Rep: Reputation: 0
Hi,

Thanks a lot for the help.

I exported a test HTML file into PDF with LibreOffice and in fact it worked (the links were conserved). However for very big files (almost 10000 pages) it halts after three hours without doing any thing.

The same happens by doing the command line method you provided, besides running unoconv gives me a Segmentation Fault error after running the script (Probably a dependency problem or a corrupted LibreOffice installation). Besides I did a test with a small html file and the links were not conserved (yet it was the case when I exported directly the file by using LibreOffice GUI)

Anyway, thank you very much both of you for your help and your time.

Regards,
Dariyoosh
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Xane scans to a pdf which Document Viewer cannot print taylorkh Linux - Software 5 09-29-2010 02:58 PM
cups-pdf command-line print .html pages without tags ? o5iri5 Linux - Software 1 08-06-2007 06:24 AM
Printing html page to pdf with links MikeyCarter Linux - Software 1 11-16-2006 06:20 PM
Control print output from PHP/HTML... PDF? alar Linux - General 1 08-05-2004 01:20 AM
print files in PDF or html format from the linux command line IBKnobel Linux - Software 3 07-12-2004 09:29 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 02:53 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration