Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Before you jump to conclusions: I have already looked and found tools like pdftohtml.
I am looking for a tool that can do something similar to pdftohtml, but simpler: I want 1 long html page that looks like it had been designed for html, not a true (-ish) representation in html of what the PDF document looks like.
So, I want an html page that has the pictures in the right places. PDF paragraphs should become HTML paragraphs (<p> ... </p>) - with no hard line breaks and no fonts etc. Is there any such tool?
Hi jefro - thanks for replying. I feared as much - but still, even an imperfect tool might do.
Or I could knit something together myself; this is open source, after all. pdftotext almost does it - one just has to run it through a filter to add HTML and detect paragraphs. The only problem is to put the pictures in the right place.
I do use LibreOffice. However, when I open a PDF file, it insists that it must be a drawing, or a collection of drawings, and the export is exactly all the things I don't want: a collection of PNGs that represent an graphically accurate, but basically useless slide-show.
What I really want is much simpler: a representation of the PDF as it would have looked if its content had been created as a simple web page:
- it would be mostly text
- the text layout will depend on the shape and size of you browser
- only paragraphs should be preserved, text inside a pararaph should be just one, long line (as it is in HTML)
- images are OK, but only the foreground ones, and there doesn't have to be fancy handling, like the text flowing around them etc;
- tables would be good too
And that's it! And I can live without the images and tables - they can be hand-edited in later, but I think they may be fairly easy to handle.
The tools I have seen to far seem to concentrate on producing a precise rendering of the PDF document, wrapped in an HTML framework so it can be displayed in a browser. But that defeats the purpose of using a web browser - the browser should do the hard work of formatting the document, otherwise, why not just use a PDF viewer?
i take it you use MS windows and notepad sees it as one long line .Instead of the html code formatting
Up to a point, up to a point ;-) - I use Debian. My preferred editor is vi - I emphasized te 'one, long line' because when the text is extracted using pdftotext (one of the poppler utils), each line in a paragraph in PDF comes out a a separate line.
Up to a point, up to a point ;-) - I use Debian. My preferred editor is vi - I emphasized te 'one, long line' because when the text is extracted using pdftotext (one of the poppler utils), each line in a paragraph in PDF comes out a a separate line.
I take it you mean "text that's in two separate columns comes out as one line"...otherwise, one line coming OUT as one line isn't really a problem, is it?
And I'll agree with the others and say there's not a really good way to do this, programmatically. I've been playing with this for about six months now, and there's no good, foolproof way to do it that I've found. I *HAVE* found some ways to sidestep some issues, but it's a total kludge....but, it DOES work, and gives decent results. It's a multi-step process, but one that can be scripted:
I use pdftotext to rip the text into a file
From there, I have a separate routine to get it 'formatted' a bit better; longer lines (since multi-column text isn't good), and stripping out any errant tages/control characters
After that, I replace line breaks with the correct HTML tags, and do other such formatting tasks.
I use pdfimages to extract any image files from the PDF, then insert the image tag into the HTML output somewhere. This I usually need to adjust manually later, to put the image in the right place on the page.
It's VERY tedious, and the results are spotty at best. A PDF is already formatted, but your browser can be ANYTHING, and can override any settings the document has.
Depending on your needs, it may be better to store the text of the article in a database, with the PDF re-ripped to be images (the convert utility can dump a PDF to a JPG file, one page per image, and you can re-create the PDF using those JPG's). That way, the PDF is searchable, but the image in the browser is consistent with the publisher needs.
I am not looking to create an HTML replica of a PDF document. Imagine that you were the author of the document, but instead of creating a PDF document, you created a web page. It would have the same text, and the same pictures, but because the medium is different, it will look very different from what a PDF document looks like.
The process I have in mind is something like what pdftotext does: run through the PDf file, extract the text; but in my version each paragraph will be surrounded by <p> ... </p> tags, pictures will be dumped as files, and <img ...> tags will be inserted in the right places. That's exactly what I want - nothing more, nothing less. I very emphatically do not want anything that tries to create an exact replica of the PDF document, whether it is in the form of 1 image per page or anything else (which is what pdftohtml does). I just want a simple, plain, pure HTML file, no CSS, no frames, no fonts ... Something that is just one step from plain ASCII text.
Quote:
A PDF is already formatted, but your browser can be ANYTHING, and can override any settings the document has.
That is precisely what I want: something that is not formatted, but simply lets the browser decide.
Quote:
I take it you mean "text that's in two separate columns comes out as one line"...otherwise, one line coming OUT as one line isn't really a problem, is it?
No, what I meant was that I find it slightly irritating that the output from pdftotext comes out as "...line from PDF file ...<line break>", "...line from PDF file ...<line break>", etc - it would be better if the lines in a paragraph were joined up into just 1 long line, or alternatively, that paragraphs were separated with 2 line breaks; that would make it easier to script afterwards.
I am not looking to create an HTML replica of a PDF document. Imagine that you were the author of the document, but instead of creating a PDF document, you created a web page. It would have the same text, and the same pictures, but because the medium is different, it will look very different from what a PDF document looks like.
The process I have in mind is something like what pdftotext does: run through the PDf file, extract the text; but in my version each paragraph will be surrounded by <p> ... </p> tags, pictures will be dumped as files, and <img ...> tags will be inserted in the right places. That's exactly what I want - nothing more, nothing less. I very emphatically do not want anything that tries to create an exact replica of the PDF document, whether it is in the form of 1 image per page or anything else (which is what pdftohtml does). I just want a simple, plain, pure HTML file, no CSS, no frames, no fonts ... Something that is just one step from plain ASCII text.
That is precisely what I want: something that is not formatted, but simply lets the browser decide.
That's exactly what I wound up doing. I typically shove the image references into the file at the bottom, knowing full well I'll have to change the position later. Another low-tech way of doing it would be to just shove a "See image below" line into the file.... This is a thorny issue, and one that I've not seen a good solution to.
Quote:
No, what I meant was that I find it slightly irritating that the output from pdftotext comes out as "...line from PDF file ...<line break>", "...line from PDF file ...<line break>", etc - it would be better if the lines in a paragraph were joined up into just 1 long line, or alternatively, that paragraphs were separated with 2 line breaks; that would make it easier to script afterwards.
Gotcha...and that's one of the problems, since there's no good way to determine where a paragraph ends. Single-column text you could break on the existing whitespace. As you're processing the 'raw' pdftotext output into the HTML, if you find a blank line you could just try to replace it with a paragraph delimiter. But multi-column could have one paragraph that spans TWO (or more)columns, so counting lines is not going to help, since they're short.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.