[SOLVED] Change pdftoppm output to 16 bits or 24 bits
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I am working to make a script that automatically converts webpages to png or jpg files. First of all, I try to convert webpages to pdf files. I have managed it using wkhtmltopdf. After this conversion, I tried several programs to convert pdf to png or jpg.
As you all know, there is imagemagick library and the convert command. This is too slow for me. It tooks at least 20 seconds to convert a PDF to PNG.
I tried NetPBM with "pdftops" command but I have failed to properly convert the PDF file PNG file.
Later on, a friend on LQ suggested that I should use "pdftoppm". This is a perfect software. I have managed to convert PDF to PPM and then use NetPBM to convert PNG or JPG (also used "cjpeg" instead of NetPBM).
After a few hours, I have found that pdftoppm can directly convert a PDF file to PNG file! That becomes awesome, because I can make it using one command. However, the problem is, produced PNG's are very low quality, even after setting resolution to 600 dpi (using -r switch).
I have searched the possible problem on this issue and I have found that the produced PNG file is 8 bits, in all cases above.
Is it possible to convert pdftoppm output to 16 or 24 bits?
Last edited by Cyrolancer; 01-22-2012 at 09:12 AM.
Click here to see the post LQ members have rated as the most helpful post in this thread.
generates othername-1.png directly, and is about 6% faster, too.
Carefully comparing and checking the othername-1.png and basename-1.png images show that they contain the exact same data, and are almost exactly the same size. This means that the PNG output option in pdftoppm works very well, for me at least. Even eyeballing the resulting image in Gimp shows that all elements are nicely antialiased (no jaggies), color gradients are smooth (no visible steps), and so on: very satisfactory quality. And I'm very particular about my image quality.
Using the pnmcolormap tool to analyze the intermediate PPM image I created with the first command above,
Code:
pnmcolormap all basename-1.ppm >/dev/null
pnmcolormap: making histogram...pnmcolormap: too many colors!pnmcolormap: scaling colors from maxval=255 to maxval=127 to improve clustering...pnmcolormap: making histogram...pnmcolormap: 21287 colors found
we can see that the image does have a lot of colors, 21287 in this case even after considering only the highest 7 bits per component.
I am using netpbm-2:10.0-12.2 and poppler-utils-0.16.7-2ubuntu2 (the latter providing the pdftoppm command).
Somehow the quality has increased a lot after setting up DPI. But there are problems with the gradients, shadows and some colors. Setting aaVector variable to "no" has corrected some of these problems, especially on gradients and colors. There are still some problems with the shadows but I think there are no ways to fix that, instead of using imagemagick.
After that, I tried
Code:
convert site.pdf site.jpg
and found that the execution time is approximately same with the above script.
The main problem is using only the "convert" command produces very very low resolution jpg image but the script above produces well-defined, readable and acceptable resolution jpg image. I think there is something to do with the parameters of "convert" command
Thank you for the detailed instructions Nominal Animal. I am using such a script to convert a webpage to JPG.
Have you tried using wkhtmltoimage instead of wkhtmltopdf? The conversion to PDF seems like unnecessary complication to me. The options are described in the README_WKHTMLTOIMAGE file.
I know wkhtmltoimage, but it is not available in Debian repos (at least for squeeze). Due to my company's policy, it is not possible to compile programs. When Debian repos are updated with wkhtmltoimage, probably I am going to change everything and use it for my purpose.
You could use the Debian webkit libraries with Python bindings to render the HTML using xvfb-run, then save it to a PNG or JPEG file.
Assuming you have python and python-webkit installed, the webkitscreenshot.py script might be a good starting point. I think it restricts the images to top 1024x768 of the page, though. (I think it uses 1024x768 to render the page, then scales it to the desired size.)
I think it might be better to have the script "tile" the page, saving each tile as a PPM image (to avoid compression overhead). Then, stitch them back together into a single image (possibly cropping out any overlap). That way you wouldn't need to worry about the page size either, you'd always get the entire page. I guess it depends on whether you want "thumbnails" of the web pages, or entire web pages as images.
I can use Python scripts on the servers and it is probably possible to install any python libraries / extensions to the server.
I think I would not use wkhtmltopdf due to security reasons (Thanks for unSpawn and okcomputer44 for their support) and I need another option which is not using X-server to process HTML pages to PDF. I am not really sure python-webkit needs x-server.
After executing "apt-get install python-webkit" on an OpenVZ Debian-based virtual machine using official image provided in OpenVZ website, these packages are shown up to be installed:
There are some libraries related with X but it doesn't seem to install full X server. That will be good for me and probably, your suggestion is the one that I am going to use.
I got curious; I've used browsershots.org and similar before, and having an utility to grab entire pages automatically might come handy.
I hacked away on the above-linked Python code. As you can see from the timestamps, I only did a quick hack, to get it working. I think it has some bugs or at least wrinkles left..
The script itself calls Xvfb to provide a virtual X server. It does need the various X libraries mentioned above to be installed, but it does not need a running X server, or a physical display at all. I worked on it in a virtual machine running Debian with a text console only, no desktop environment installed at all, so I am sure of that. Remember to install some fonts (ttf-* packages), to get nicer web pages. Most web pages name their fonts without falling back to plain ones too gracefully, so having as many fonts installed as possible will make the pages closer to their designers' intent.
Specifically, on top of a clean, minimal Debian 6.0.3 (Squeeze) install, (graphical desktop environment explicitly unselected and thus not installed at all), I installed
The script calls Xvfb itself internally, so there is no need to use xvfb-run. It will not help, it will just slow things down. I also switched the default Xvfb server number to 2, so that if you do run it on a workstation with an X server, it'll still use the Xvfb and not your real X.
Edited: This version uses urlparse from urlparse to make sure the URL is correctly escaped. If you wish to refer to a local file, use the absolute or relative path (i.e. start the file name or path with /, ./ or ../). Thanks to Cyrolancer for pointing it out!
Run without parameters to see the usage. In a nutshell, the usage is
Code:
python script -o image.pngURL
It is pretty fast, too. Linuxquestions Forums page:
Code:
time python url2png -x 1920 -y 1080 -o lq.pnghttp://www.linuxquestions.org/questions/Saving 1920 x 3081 PNG image 'lq.png'lq.png: Image saved successfullyreal: 0m6.821suser: 0m2.136ssys: 0m0.536s
ls -l lq.png-rw-r--r-- 1 user group 719220 2012-01-25 07:08 lq.png
Most of the time is from loading the page; a local test page renders in less than a second. (If the Linuxquestions Forums page was local, it would have been rendered in less than three seconds, and it's a pretty complex page.)
If you want the script to be silent, omit the print lines.
Note that the screen width and height (specified using the -x and -y options) define the browser window size. Since the image is of the contents, the image size may be larger. If the page is taller or wider, then the image will be taller or wider, too. The layout on most pages depends on the browser window size, though.
Selecting the font size does not matter much on typical webpages, since they define their font sizes in points, not relative to the user default. Changing the DPI (option -z DPI for the script) reported by Xvfb affects only pages that use points (as opposed to pixels) to define the font size. For others, we could use the page zoom feature, but my initial tests showed it was too buggy: sometimes only part of the page would render. I think the page would need a reload or something to render properly when zoom is used.
The main difference with the original script I linked to, is that this one renders the window to a pixmap first, then converts the pixmap to a pixbuf, and finally saves the pixbuf as a PNG image directly using the pixbuf save function. The pixmap is necessary to get all page content, not just the part that is "visible" to Xvfb. If you prefer JPEG output (over PNG as used now), change the line to
Security-wise, you could run the script using a dedicated user account, with very limited access to files. You could even wipe the user's home directory clean after each invocation, to make sure that even if a malicious webpage manages to cut through webkit, anything it might manage to save on your server would be wiped out anyway.
If you want to develop the script further, just start a new thread in the Programming forum -- perhaps including the above script and whatever you feel pertinent from this post, as a starting point. Feel free to use the script in any way you like.
Hope you find this useful,
Last edited by Nominal Animal; 01-26-2012 at 02:01 PM.
Reason: Replaced the url = args[0] line.
I think I have found an error on this script. When I try to open a page with such a link "jquery.js?m", there are some errors like:
Ah, I forgot webkit does not escape the URL.
I think it is better to fix the URL a bit earlier, in the main function. I think the urlparse function works better, too. The fixed version above will even detect local file paths correctly, if you start the local file reference with /, ./ or ../. (After all, ./filename is always the same as filename.)
Quote:
Originally Posted by Cyrolancer
Well, I really don't know Python much Maybe I have done something wrong, but after this small change, it works.
Well spotted! I used urlparse() instead so that both local (paths) and remote URLs work.
Quote:
Originally Posted by Cyrolancer
Small edit: Also having warning. I am pretty sure that X has loaded RANDR extension. Maybe this warning comes out of the problem, not using a real X?
Yup, it is a harmless warning. The XRANDR extension is just not enabled for Xvfb. The extension is used to manage resolution changes, display rotation, and that sort of stuff.
The issue has been already reported with a suggested fix -- needs all of eleven lines changed -- but somebody would have to send an email to the xorg-devel mailing list, and ask it to be reviewed and included.
Thank you for your corrections on the script. I am just a person that knows "python" name only, not the python coding itself. I will gladly accept your solution on this problem.
Just want to ask a simple question. Because I don't know much about python, this seemed easy to me. If it is complicated or hard to do, please ignore this suggestion: Is it possible to make xvfb run silently, without giving out errors? Or maybe it can pass the errors on a script or print - append all errors and warnings to a text file for later investigations? I couldn't understand the part that runs xvfb. Tried some changes but always got errors.
Is it possible to make xvfb run silently, without giving out errors?
Of course. The trick is that it is webkit (actually some Xlib function called by webkit) that outputs the error, not Xvfb. In other words, you need to redirect the standard output and standard error streams elsewhere in the Python code.
This version of the script supports two additional command line arguments: -v and -q.
By default, it will output the image file name to standard output if successful, or an error message to standard error otherwise.
If you use the -q option, it will never print anything.
If you use the -v option (at least once), it will include the size and the format of the image file name in the standard output message.
If the URL cannot be loaded, or if the image file cannot be saved, it will exit with exit status 1. Otherwise, it will exit with exit status 0.
.. your shell will complain about the & characters, because it thinks you want to run five commands in parallel (third one being lf=plcp) instead of one command.
Use single quotes when supplying the URLs by hand, i.e.
This is not specific to this script in any way; this is what you always have to do when using a command-line shell.
Please read the Quoting chapter in the Bash Reference Manual for details and explanations. It's not long! After that, just remember that every command you type or write in a Bash script is first interpreted by the shell; the actual inputs the command receives is after shell processing.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.