PDF to html conversion?

fargris · 11-16-2013, 03:46 AM

Before you jump to conclusions: I have already looked and found tools like pdftohtml.

I am looking for a tool that can do something similar to pdftohtml, but simpler: I want 1 long html page that looks like it had been designed for html, not a true (-ish) representation in html of what the PDF document looks like.

So, I want an html page that has the pictures in the right places. PDF paragraphs should become HTML paragraphs ( ... ) - with no hard line breaks and no fonts etc. Is there any such tool?

jefro · 11-16-2013, 05:00 PM

I'd think the only perfect tool is from Adobe. Every other tool might need manual inspection.

fargris · 11-17-2013, 02:10 AM

Hi jefro - thanks for replying. I feared as much - but still, even an imperfect tool might do.

Or I could knit something together myself; this is open source, after all. pdftotext almost does it - one just has to run it through a filter to add HTML and detect paragraphs. The only problem is to put the pictures in the right place.

jefro · 11-17-2013, 05:51 PM

I've never seen a perfect tool. You always have to go back and inspect. Maybe OpenOffice one at a time but that might take forever?

Maybe others have some thoughts?

John VV · 11-17-2013, 05:59 PM

I take it you have not used LibreOffice ?

r-click the pdf and select "open with libreoffice4"
the export as html (the export is customizable)

then in your browser open the html file of the name of the pdf

but you might want to use a odt document as an intermediary step

fargris · 11-18-2013, 12:58 AM

I do use LibreOffice. However, when I open a PDF file, it insists that it must be a drawing, or a collection of drawings, and the export is exactly all the things I don't want: a collection of PNGs that represent an graphically accurate, but basically useless slide-show.

What I really want is much simpler: a representation of the PDF as it would have looked if its content had been created as a simple web page:

- it would be mostly text
- the text layout will depend on the shape and size of you browser
- only paragraphs should be preserved, text inside a pararaph should be just one, long line (as it is in HTML)
- images are OK, but only the foreground ones, and there doesn't have to be fancy handling, like the text flowing around them etc;
- tables would be good too

And that's it! And I can live without the images and tables - they can be hand-edited in later, but I think they may be fairly easy to handle.

The tools I have seen to far seem to concentrate on producing a precise rendering of the PDF document, wrapped in an HTML framework so it can be displayed in a browser. But that defeats the purpose of using a web browser - the browser should do the hard work of formatting the document, otherwise, why not just use a PDF viewer?

John VV · 11-18-2013, 01:12 AM

Quote:

- the text layout will depend on the shape and size of you browser

that is done by having the text in a database table
and dynamically drawn when you visit the site

well you can use a OCR on the png's ( well png to ppm to text file)
and use the text as you need

standard fonts are no problem for even the basic OCR that is in most distros

Quote:

only paragraphs should be preserved, text inside a pararaph should be just one, long line (as it is in HTML)

i take it you use MS windows and notepad sees it as one long line .Instead of the html code formatting

try "SctTe " or the NON Microsoft program "notepad++" ( THIS IS NOT A MS PRODUCT )

fargris · 11-18-2013, 01:34 AM

Quote:

i take it you use MS windows and notepad sees it as one long line .Instead of the html code formatting

Up to a point, up to a point ;-) - I use Debian. My preferred editor is vi - I emphasized te 'one, long line' because when the text is extracted using pdftotext (one of the poppler utils), each line in a paragraph in PDF comes out a a separate line.

TB0ne · 11-18-2013, 12:59 PM

Quote:

Originally Posted by fargris

Up to a point, up to a point ;-) - I use Debian. My preferred editor is vi - I emphasized te 'one, long line' because when the text is extracted using pdftotext (one of the poppler utils), each line in a paragraph in PDF comes out a a separate line.

I take it you mean "text that's in two separate columns comes out as one line"...otherwise, one line coming OUT as one line isn't really a problem, is it?

And I'll agree with the others and say there's not a really good way to do this, programmatically. I've been playing with this for about six months now, and there's no good, foolproof way to do it that I've found. I *HAVE* found some ways to sidestep some issues, but it's a total kludge....but, it DOES work, and gives decent results. It's a multi-step process, but one that can be scripted:

I use pdftotext to rip the text into a file
From there, I have a separate routine to get it 'formatted' a bit better; longer lines (since multi-column text isn't good), and stripping out any errant tages/control characters
After that, I replace line breaks with the correct HTML tags, and do other such formatting tasks.
I use pdfimages to extract any image files from the PDF, then insert the image tag into the HTML output somewhere. This I usually need to adjust manually later, to put the image in the right place on the page.

It's VERY tedious, and the results are spotty at best. A PDF is already formatted, but your browser can be ANYTHING, and can override any settings the document has.

Depending on your needs, it may be better to store the text of the article in a database, with the PDF re-ripped to be images (the convert utility can dump a PDF to a JPG file, one page per image, and you can re-create the PDF using those JPG's). That way, the PDF is searchable, but the image in the browser is consistent with the publisher needs.

fargris · 11-19-2013, 04:11 AM

Hi TB0ne, good enough is good enough for me :-)

I am not looking to create an HTML replica of a PDF document. Imagine that you were the author of the document, but instead of creating a PDF document, you created a web page. It would have the same text, and the same pictures, but because the medium is different, it will look very different from what a PDF document looks like.

The process I have in mind is something like what pdftotext does: run through the PDf file, extract the text; but in my version each paragraph will be surrounded by ... tags, pictures will be dumped as files, and <img ...> tags will be inserted in the right places. That's exactly what I want - nothing more, nothing less. I very emphatically do not want anything that tries to create an exact replica of the PDF document, whether it is in the form of 1 image per page or anything else (which is what pdftohtml does). I just want a simple, plain, pure HTML file, no CSS, no frames, no fonts ... Something that is just one step from plain ASCII text.

Quote:

A PDF is already formatted, but your browser can be ANYTHING, and can override any settings the document has.

That is precisely what I want: something that is not formatted, but simply lets the browser decide.

Quote:

I take it you mean "text that's in two separate columns comes out as one line"...otherwise, one line coming OUT as one line isn't really a problem, is it?

No, what I meant was that I find it slightly irritating that the output from pdftotext comes out as "...line from PDF file ...<line break>", "...line from PDF file ...<line break>", etc - it would be better if the lines in a paragraph were joined up into just 1 long line, or alternatively, that paragraphs were separated with 2 line breaks; that would make it easier to script afterwards.

TB0ne · 11-19-2013, 08:25 AM

Quote:

Originally Posted by fargris

Hi TB0ne, good enough is good enough for me :-)

I am not looking to create an HTML replica of a PDF document. Imagine that you were the author of the document, but instead of creating a PDF document, you created a web page. It would have the same text, and the same pictures, but because the medium is different, it will look very different from what a PDF document looks like.

The process I have in mind is something like what pdftotext does: run through the PDf file, extract the text; but in my version each paragraph will be surrounded by ... tags, pictures will be dumped as files, and <img ...> tags will be inserted in the right places. That's exactly what I want - nothing more, nothing less. I very emphatically do not want anything that tries to create an exact replica of the PDF document, whether it is in the form of 1 image per page or anything else (which is what pdftohtml does). I just want a simple, plain, pure HTML file, no CSS, no frames, no fonts ... Something that is just one step from plain ASCII text.

That is precisely what I want: something that is not formatted, but simply lets the browser decide.

That's exactly what I wound up doing. I typically shove the image references into the file at the bottom, knowing full well I'll have to change the position later. Another low-tech way of doing it would be to just shove a "See image below" line into the file....

This is a thorny issue, and one that I've not seen a good solution to.

Quote:

No, what I meant was that I find it slightly irritating that the output from pdftotext comes out as "...line from PDF file ...<line break>", "...line from PDF file ...<line break>", etc - it would be better if the lines in a paragraph were joined up into just 1 long line, or alternatively, that paragraphs were separated with 2 line breaks; that would make it easier to script afterwards.

Gotcha...and that's one of the problems, since there's no good way to determine where a paragraph ends. Single-column text you could break on the existing whitespace. As you're processing the 'raw' pdftotext output into the HTML, if you find a blank line you could just try to replace it with a paragraph delimiter. But multi-column could have one paragraph that spans TWO (or more)columns, so counting lines is not going to help, since they're short.