PDF to html conversion?
Before you jump to conclusions: I have already looked and found tools like pdftohtml.
I am looking for a tool that can do something similar to pdftohtml, but simpler: I want 1 long html page that looks like it had been designed for html, not a true (-ish) representation in html of what the PDF document looks like. So, I want an html page that has the pictures in the right places. PDF paragraphs should become HTML paragraphs (<p> ... </p>) - with no hard line breaks and no fonts etc. Is there any such tool? |
I'd think the only perfect tool is from Adobe. Every other tool might need manual inspection.
|
Hi jefro - thanks for replying. I feared as much - but still, even an imperfect tool might do.
Or I could knit something together myself; this is open source, after all. pdftotext almost does it - one just has to run it through a filter to add HTML and detect paragraphs. The only problem is to put the pictures in the right place. |
I've never seen a perfect tool. You always have to go back and inspect. Maybe OpenOffice one at a time but that might take forever?
Maybe others have some thoughts? |
I take it you have not used LibreOffice ?
r-click the pdf and select "open with libreoffice4" the export as html (the export is customizable) then in your browser open the html file of the name of the pdf but you might want to use a odt document as an intermediary step |
I do use LibreOffice. However, when I open a PDF file, it insists that it must be a drawing, or a collection of drawings, and the export is exactly all the things I don't want: a collection of PNGs that represent an graphically accurate, but basically useless slide-show.
What I really want is much simpler: a representation of the PDF as it would have looked if its content had been created as a simple web page: - it would be mostly text - the text layout will depend on the shape and size of you browser - only paragraphs should be preserved, text inside a pararaph should be just one, long line (as it is in HTML) - images are OK, but only the foreground ones, and there doesn't have to be fancy handling, like the text flowing around them etc; - tables would be good too And that's it! And I can live without the images and tables - they can be hand-edited in later, but I think they may be fairly easy to handle. The tools I have seen to far seem to concentrate on producing a precise rendering of the PDF document, wrapped in an HTML framework so it can be displayed in a browser. But that defeats the purpose of using a web browser - the browser should do the hard work of formatting the document, otherwise, why not just use a PDF viewer? |
Quote:
and dynamically drawn when you visit the site well you can use a OCR on the png's ( well png to ppm to text file) and use the text as you need standard fonts are no problem for even the basic OCR that is in most distros Quote:
try "SctTe " or the NON Microsoft program "notepad++" ( THIS IS NOT A MS PRODUCT ) |
Quote:
|
Quote:
And I'll agree with the others and say there's not a really good way to do this, programmatically. I've been playing with this for about six months now, and there's no good, foolproof way to do it that I've found. I *HAVE* found some ways to sidestep some issues, but it's a total kludge....but, it DOES work, and gives decent results. It's a multi-step process, but one that can be scripted:
Depending on your needs, it may be better to store the text of the article in a database, with the PDF re-ripped to be images (the convert utility can dump a PDF to a JPG file, one page per image, and you can re-create the PDF using those JPG's). That way, the PDF is searchable, but the image in the browser is consistent with the publisher needs. |
Hi TB0ne, good enough is good enough for me :-)
I am not looking to create an HTML replica of a PDF document. Imagine that you were the author of the document, but instead of creating a PDF document, you created a web page. It would have the same text, and the same pictures, but because the medium is different, it will look very different from what a PDF document looks like. The process I have in mind is something like what pdftotext does: run through the PDf file, extract the text; but in my version each paragraph will be surrounded by <p> ... </p> tags, pictures will be dumped as files, and <img ...> tags will be inserted in the right places. That's exactly what I want - nothing more, nothing less. I very emphatically do not want anything that tries to create an exact replica of the PDF document, whether it is in the form of 1 image per page or anything else (which is what pdftohtml does). I just want a simple, plain, pure HTML file, no CSS, no frames, no fonts ... Something that is just one step from plain ASCII text. Quote:
Quote:
|
Quote:
Quote:
|
All times are GMT -5. The time now is 05:29 AM. |