LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 11-16-2013, 03:46 AM   #1
fargris
LQ Newbie
 
Registered: Sep 2013
Distribution: Debian
Posts: 15

Rep: Reputation: Disabled
PDF to html conversion?


Before you jump to conclusions: I have already looked and found tools like pdftohtml.

I am looking for a tool that can do something similar to pdftohtml, but simpler: I want 1 long html page that looks like it had been designed for html, not a true (-ish) representation in html of what the PDF document looks like.

So, I want an html page that has the pictures in the right places. PDF paragraphs should become HTML paragraphs (<p> ... </p>) - with no hard line breaks and no fonts etc. Is there any such tool?
 
Old 11-16-2013, 05:00 PM   #2
jefro
Moderator
 
Registered: Mar 2008
Posts: 22,001

Rep: Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629
I'd think the only perfect tool is from Adobe. Every other tool might need manual inspection.
 
Old 11-17-2013, 02:10 AM   #3
fargris
LQ Newbie
 
Registered: Sep 2013
Distribution: Debian
Posts: 15

Original Poster
Rep: Reputation: Disabled
Hi jefro - thanks for replying. I feared as much - but still, even an imperfect tool might do.

Or I could knit something together myself; this is open source, after all. pdftotext almost does it - one just has to run it through a filter to add HTML and detect paragraphs. The only problem is to put the pictures in the right place.
 
Old 11-17-2013, 05:51 PM   #4
jefro
Moderator
 
Registered: Mar 2008
Posts: 22,001

Rep: Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629Reputation: 3629
I've never seen a perfect tool. You always have to go back and inspect. Maybe OpenOffice one at a time but that might take forever?

Maybe others have some thoughts?
 
Old 11-17-2013, 05:59 PM   #5
John VV
LQ Muse
 
Registered: Aug 2005
Location: A2 area Mi.
Posts: 17,627

Rep: Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651
I take it you have not used LibreOffice ?

r-click the pdf and select "open with libreoffice4"
the export as html (the export is customizable)

then in your browser open the html file of the name of the pdf


but you might want to use a odt document as an intermediary step

Last edited by John VV; 11-17-2013 at 06:13 PM.
 
Old 11-18-2013, 12:58 AM   #6
fargris
LQ Newbie
 
Registered: Sep 2013
Distribution: Debian
Posts: 15

Original Poster
Rep: Reputation: Disabled
I do use LibreOffice. However, when I open a PDF file, it insists that it must be a drawing, or a collection of drawings, and the export is exactly all the things I don't want: a collection of PNGs that represent an graphically accurate, but basically useless slide-show.

What I really want is much simpler: a representation of the PDF as it would have looked if its content had been created as a simple web page:

- it would be mostly text
- the text layout will depend on the shape and size of you browser
- only paragraphs should be preserved, text inside a pararaph should be just one, long line (as it is in HTML)
- images are OK, but only the foreground ones, and there doesn't have to be fancy handling, like the text flowing around them etc;
- tables would be good too

And that's it! And I can live without the images and tables - they can be hand-edited in later, but I think they may be fairly easy to handle.

The tools I have seen to far seem to concentrate on producing a precise rendering of the PDF document, wrapped in an HTML framework so it can be displayed in a browser. But that defeats the purpose of using a web browser - the browser should do the hard work of formatting the document, otherwise, why not just use a PDF viewer?
 
Old 11-18-2013, 01:12 AM   #7
John VV
LQ Muse
 
Registered: Aug 2005
Location: A2 area Mi.
Posts: 17,627

Rep: Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651Reputation: 2651
Quote:
- the text layout will depend on the shape and size of you browser
that is done by having the text in a database table
and dynamically drawn when you visit the site


well you can use a OCR on the png's ( well png to ppm to text file)
and use the text as you need

standard fonts are no problem for even the basic OCR that is in most distros

Quote:
only paragraphs should be preserved, text inside a pararaph should be just one, long line (as it is in HTML)
i take it you use MS windows and notepad sees it as one long line .Instead of the html code formatting

try "SctTe " or the NON Microsoft program "notepad++" ( THIS IS NOT A MS PRODUCT )

Last edited by John VV; 11-18-2013 at 01:16 AM.
 
Old 11-18-2013, 01:34 AM   #8
fargris
LQ Newbie
 
Registered: Sep 2013
Distribution: Debian
Posts: 15

Original Poster
Rep: Reputation: Disabled
Quote:
i take it you use MS windows and notepad sees it as one long line .Instead of the html code formatting
Up to a point, up to a point ;-) - I use Debian. My preferred editor is vi - I emphasized te 'one, long line' because when the text is extracted using pdftotext (one of the poppler utils), each line in a paragraph in PDF comes out a a separate line.
 
Old 11-18-2013, 12:59 PM   #9
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,689

Rep: Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972
Quote:
Originally Posted by fargris View Post
Up to a point, up to a point ;-) - I use Debian. My preferred editor is vi - I emphasized te 'one, long line' because when the text is extracted using pdftotext (one of the poppler utils), each line in a paragraph in PDF comes out a a separate line.
I take it you mean "text that's in two separate columns comes out as one line"...otherwise, one line coming OUT as one line isn't really a problem, is it?

And I'll agree with the others and say there's not a really good way to do this, programmatically. I've been playing with this for about six months now, and there's no good, foolproof way to do it that I've found. I *HAVE* found some ways to sidestep some issues, but it's a total kludge....but, it DOES work, and gives decent results. It's a multi-step process, but one that can be scripted:
  • I use pdftotext to rip the text into a file
  • From there, I have a separate routine to get it 'formatted' a bit better; longer lines (since multi-column text isn't good), and stripping out any errant tages/control characters
  • After that, I replace line breaks with the correct HTML tags, and do other such formatting tasks.
  • I use pdfimages to extract any image files from the PDF, then insert the image tag into the HTML output somewhere. This I usually need to adjust manually later, to put the image in the right place on the page.
It's VERY tedious, and the results are spotty at best. A PDF is already formatted, but your browser can be ANYTHING, and can override any settings the document has.

Depending on your needs, it may be better to store the text of the article in a database, with the PDF re-ripped to be images (the convert utility can dump a PDF to a JPG file, one page per image, and you can re-create the PDF using those JPG's). That way, the PDF is searchable, but the image in the browser is consistent with the publisher needs.
 
Old 11-19-2013, 04:11 AM   #10
fargris
LQ Newbie
 
Registered: Sep 2013
Distribution: Debian
Posts: 15

Original Poster
Rep: Reputation: Disabled
Hi TB0ne, good enough is good enough for me :-)

I am not looking to create an HTML replica of a PDF document. Imagine that you were the author of the document, but instead of creating a PDF document, you created a web page. It would have the same text, and the same pictures, but because the medium is different, it will look very different from what a PDF document looks like.

The process I have in mind is something like what pdftotext does: run through the PDf file, extract the text; but in my version each paragraph will be surrounded by <p> ... </p> tags, pictures will be dumped as files, and <img ...> tags will be inserted in the right places. That's exactly what I want - nothing more, nothing less. I very emphatically do not want anything that tries to create an exact replica of the PDF document, whether it is in the form of 1 image per page or anything else (which is what pdftohtml does). I just want a simple, plain, pure HTML file, no CSS, no frames, no fonts ... Something that is just one step from plain ASCII text.

Quote:
A PDF is already formatted, but your browser can be ANYTHING, and can override any settings the document has.
That is precisely what I want: something that is not formatted, but simply lets the browser decide.

Quote:
I take it you mean "text that's in two separate columns comes out as one line"...otherwise, one line coming OUT as one line isn't really a problem, is it?
No, what I meant was that I find it slightly irritating that the output from pdftotext comes out as "...line from PDF file ...<line break>", "...line from PDF file ...<line break>", etc - it would be better if the lines in a paragraph were joined up into just 1 long line, or alternatively, that paragraphs were separated with 2 line breaks; that would make it easier to script afterwards.
 
Old 11-19-2013, 08:25 AM   #11
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,689

Rep: Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972
Quote:
Originally Posted by fargris View Post
Hi TB0ne, good enough is good enough for me :-)

I am not looking to create an HTML replica of a PDF document. Imagine that you were the author of the document, but instead of creating a PDF document, you created a web page. It would have the same text, and the same pictures, but because the medium is different, it will look very different from what a PDF document looks like.

The process I have in mind is something like what pdftotext does: run through the PDf file, extract the text; but in my version each paragraph will be surrounded by <p> ... </p> tags, pictures will be dumped as files, and <img ...> tags will be inserted in the right places. That's exactly what I want - nothing more, nothing less. I very emphatically do not want anything that tries to create an exact replica of the PDF document, whether it is in the form of 1 image per page or anything else (which is what pdftohtml does). I just want a simple, plain, pure HTML file, no CSS, no frames, no fonts ... Something that is just one step from plain ASCII text.

That is precisely what I want: something that is not formatted, but simply lets the browser decide.
That's exactly what I wound up doing. I typically shove the image references into the file at the bottom, knowing full well I'll have to change the position later. Another low-tech way of doing it would be to just shove a "See image below" line into the file.... This is a thorny issue, and one that I've not seen a good solution to.
Quote:
No, what I meant was that I find it slightly irritating that the output from pdftotext comes out as "...line from PDF file ...<line break>", "...line from PDF file ...<line break>", etc - it would be better if the lines in a paragraph were joined up into just 1 long line, or alternatively, that paragraphs were separated with 2 line breaks; that would make it easier to script afterwards.
Gotcha...and that's one of the problems, since there's no good way to determine where a paragraph ends. Single-column text you could break on the existing whitespace. As you're processing the 'raw' pdftotext output into the HTML, if you find a blank line you could just try to replace it with a paragraph delimiter. But multi-column could have one paragraph that spans TWO (or more)columns, so counting lines is not going to help, since they're short.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Merge Of Html Files Into A Single Html (or Pdf) fiomba Linux - Software 10 05-11-2018 11:28 AM
PDF to MS Word conversion sshatz Linux - Software 5 10-25-2007 04:58 PM
PS to PDF conversion. binoykr SUSE / openSUSE 2 02-21-2007 09:26 AM
Convert pdf to html or txt or remaster the pdf? jago25_98 Linux - Software 1 12-13-2005 01:11 AM
pdf to doc conversion hoffmanyew Linux - Software 2 10-28-2004 09:50 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 02:37 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration