Solved: Question about ImageMagick's convert utility and high quality output

Dogs · 06-22-2011, 02:45 PM

Hello,

I am trying to split a PDF into component pages that are of equal quality to the original.

If I use display mypdf.pdf, then the PDF is split into pages of acceptable quality. The problem here is that I have to save each page individually.

If, however, I use convert mypdf.pdf mypdf.bmp, I get the individual pages of the PDF in .BMP format (which is fine, but not exactly what I want), but the quality is substantially less than the original.

I've tried dozens of combinations of commands to try to increase this quality, but to no avail.

Even if I do convert mypdf.pdf mypdfagain.pdf, there is a big loss of quality.

Anyone familiar with splitting a PDF into individual pages without suffering a loss in quality?
Ideally, I would just save all the "scenes/frames" from display, but that feature unfortunately does not exist (though I may endeavor myself to add it if no formal solution exists).

NOTE: I think part of my problem might be: by using identify mypdf.pdf I can see that the resolution is specified, and when I convert it the resolution is much lower. This could be a source of quality loss, but I'm not familiar enough with image conversion to say that for sure.

Solution----------------

Oh, might help to read the man-page all the way through.

display -write outfile.pdf infile.pdf

It will do an entire book at once.

Whatever this command does, it removes the extra layer or whatever it is that prevents OCR from succeeding. I'd really like to understand that technology.. What is it about a PDF that allows an individual to embed some meta-data into every page of the PDF so that the only thing seen, say, through OCR, or a text search function, is the embedded text?

Vrajgh · 06-22-2011, 03:18 PM

Do these multiple pages need to be in image formats or would a pdf of each page be acceptable? If so, pdftk (http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/) might be a better tool for the job.

smoker · 06-22-2011, 04:17 PM

You could try this command, put into a looping bash script.

Code:

gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dFirstPage=n -dLastPage=m-sOutputFile=outfile.pdf infile.pdf

where n and m are identical to export 1 page only.

http://centaur.maths.qmw.ac.uk/Info/pdf-faq.html

H_TeXMeX_H · 06-23-2011, 12:22 PM

I would use 'pdftoppm' to convert pdfs to images.

However, why do you need to do this ? I don't even deal with pdfs, I just convert them to djvu, delete the pdf, and work with the djvu.

Dogs · 06-24-2011, 08:05 PM

I bought an ebook with DRM software required to use it. I have found a way to get around the DRM software, but the quality issue prevents me from satisfactorily using OCR software to make image into text.

My current point is: I have a free PDF that is of high quality, but I am unable to OCR the PDF directly because of some kind of layering mechanism...

This, as far as I can tell, layer is the only thing the OCR software is able to "see", and the only thing on this layer is an embedded e-mail address: Thus, OCR gives me pages upon pages that contain only an e-mail address, when what I'm looking at is clearly pages in the book I purchased (which conveniently left out the part about DRM until AFTER the purchase. It is only available from the publisher anyway, so it's not like I have a choice if I want an ebook)...

However, if I split the PDF into pages and/or flatten it and/or convert it to image files, then I can OCR that just fine if the quality is sufficient.

What's cool is: If I open the PDF in the ghostscript viewer, I can save individual pages as excellent copies with the layering mechanism mitigated. Now just to figure out how to automatically split 675 pages...

the gs command provided by Mr. Smoker seems to be just what the doctor ordered, however, I haven't had time to figure out which device to use if pdfwrite isn't available.

H_TeXMeX_H · 06-25-2011, 03:19 AM

Quote:

Originally Posted by Dogs

What's cool is: If I open the PDF in the ghostscript viewer, I can save individual pages as excellent copies with the layering mechanism mitigated. Now just to figure out how to automatically split 675 pages...

the gs command provided by Mr. Smoker seems to be just what the doctor ordered, however, I haven't had time to figure out which device to use if pdfwrite isn't available.

Just write a script for it that will extract all those pages.

Tinkster · 06-25-2011, 04:40 PM

Quote:

Originally Posted by Dogs

the gs command provided by Mr. Smoker seems to be just what the doctor ordered, however, I haven't had time to figure out which device to use if pdfwrite isn't available.

This one ;}

Cheers,
Tink