LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Blogs > linux-related notes
User Name
Password

Notices


Just annotations of little "how to's", so I know I can find how to do something I've already done when I need to do it again, in case I don't remember anymore, which is not unlikely. Hopefully they can be useful to others, but I can't guarantee that it will work, or that it won't even make things worse.
Rate this Entry

Gruesome PDF to JPG converter script

Posted 03-28-2014 at 08:57 PM by the dsc
Updated 04-10-2014 at 12:11 PM by the dsc (fixing a bug)
Tags pdf

Requires gs, imagemagick, jpegoptim, cpulimit, and "coolloop" (or maybe not).

Has no options or anything.

You run it like:

howeveryounameit.sh appropriate-filename portable-document-file-to-convert.pdf

The result will be:

appropriate-filename-001.jpg, appropriate-filename-002.jpg and so on.

So you don't add numbers or extension on the first parameter ($1). I guess it may not deal well with spaces as well, even if between quotes. The number padding is arbitrarily three digits, as set by "%03d" on the gs line. Change 3 to 4 for an extra 0 and so on.


I've found this "gs" line googling around, and for some reason I found better than how imagemagick's own "convert" deals with PDF (I don't even remember why). It will create somewhat large images (JPG already, but perhaps it would be better to have PNGs at this point, I got to fix it sometime), that may have ugly "borders"/leftovers that PDF readers automatically crop on display, and once all pages are converted to jpg on /tmp/name-of-choice/, imagemagick's convert will resize it (shrinking pages with larger height than 1300px), saving on /dev/shm (once I did some tests and was somewhat faster than mogrify, perhaps because my HDD is old and slow (and still surviving for longer than some newer ones I had)), and then, finally, jpegoptim will optimize those and save them on the current folder.

Code:
if [ ! -d /tmp/$1 ] ; then
mkdir /tmp/$1
mkdir /dev/shm/$1
else
rm /tmp/$1/*
rm /dev/shm/$1/*
fi

echo gs... 
cpulimit -l 25 -e gs &> /dev/null &
cpulpid1=$!

gs -q -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=200000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=jpeg -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r130  -sOutputFile=/tmp/$1/$1-%03d.jpg $2

printf "converting..."
for i in /tmp/$1/*.jpg ; do convert -interlace none -strip -filter Lanczos -trim +repage -sampling-factor 1x1 -quality 75 -resize x1300\>  --  "$i" /dev/shm/"$1"/"$(basename $i)"    ; printf "... " ; coolloop ; done

for i in /dev/shm/$1/*.jpg ; do  jpegoptim -d./ -m75 --strip-all --all-progressive "$i" ; coolloop ; done
for i in /dev/shm/$1/*.jpg ; do if [ ! -f ./$(basename $i) ] ; then cp $i ./ ; fi ; done

kill -kill $cpulpid1 &> /dev/null

rm /tmp/$1/*
rm /dev/shm/$1/*
rmdir /tmp/$1
rmdir /dev/shm/$1

EDIT: For some PDFs, it's far better to have the first step with pdftoppm instead of gs. Pdftoppm will somehow generate somewhat nicely "rendered" images (ppm format obviously), whereas gs will make somewhat dithered/pure black and white images that will lose lots of gradations and text "anti-aliasing". I think that pdftoppm will also generate the images more or less as a "screen capture" of the PDF page, whereas gs seems to deal more directly with the "raw source" of the PDF, which isn't always a literal image, but has text and images as separate elements; the result is that gs won't generate any images for some pages, sometimes, and will also "disassemble" the images and text, more or less like a decompressed epub I guess. Except that the text is gone.

One may want this temporary step to take place in /tmp or /dev/shm, if the RAM is big enough. PPM files can also be rather large, like 15 MB a page (even if the entire PDF itself is smaller than 15 MB, it's ridiculously disproportional). Perhaps it's possible to pipe the output of individual pages so that there's not temporary files at all.
Posted in Uncategorized
Views 1620 Comments 0
« Prev     Main     Next »
Total Comments 0

Comments

 

  



All times are GMT -5. The time now is 07:13 PM.

Main Menu
Advertisement
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration