LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 06-22-2011, 02:45 PM   #1
Dogs
Member
 
Registered: Aug 2009
Location: Houston
Distribution: Slackware 13.37 x64
Posts: 105

Rep: Reputation: 25
Solved: Question about ImageMagick's convert utility and high quality output


Hello,

I am trying to split a PDF into component pages that are of equal quality to the original.

If I use display mypdf.pdf, then the PDF is split into pages of acceptable quality. The problem here is that I have to save each page individually.

If, however, I use convert mypdf.pdf mypdf.bmp, I get the individual pages of the PDF in .BMP format (which is fine, but not exactly what I want), but the quality is substantially less than the original.

I've tried dozens of combinations of commands to try to increase this quality, but to no avail.

Even if I do convert mypdf.pdf mypdfagain.pdf, there is a big loss of quality.


Anyone familiar with splitting a PDF into individual pages without suffering a loss in quality?
Ideally, I would just save all the "scenes/frames" from display, but that feature unfortunately does not exist (though I may endeavor myself to add it if no formal solution exists).


NOTE: I think part of my problem might be: by using identify mypdf.pdf I can see that the resolution is specified, and when I convert it the resolution is much lower. This could be a source of quality loss, but I'm not familiar enough with image conversion to say that for sure.




Solution----------------

Oh, might help to read the man-page all the way through.

display -write outfile.pdf infile.pdf

It will do an entire book at once.

Whatever this command does, it removes the extra layer or whatever it is that prevents OCR from succeeding. I'd really like to understand that technology.. What is it about a PDF that allows an individual to embed some meta-data into every page of the PDF so that the only thing seen, say, through OCR, or a text search function, is the embedded text?

Last edited by Dogs; 06-29-2011 at 12:59 AM. Reason: SOLVED
 
Old 06-22-2011, 03:18 PM   #2
Vrajgh
Member
 
Registered: Aug 2005
Posts: 68

Rep: Reputation: 33
Do these multiple pages need to be in image formats or would a pdf of each page be acceptable? If so, pdftk (http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/) might be a better tool for the job.
 
Old 06-22-2011, 04:17 PM   #3
smoker
Senior Member
 
Registered: Oct 2004
Distribution: Fedora Core 4, 12, 13, 14, 15, 17
Posts: 2,279

Rep: Reputation: 250Reputation: 250Reputation: 250
You could try this command, put into a looping bash script.

Code:
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dFirstPage=n -dLastPage=m-sOutputFile=outfile.pdf infile.pdf
where n and m are identical to export 1 page only.

http://centaur.maths.qmw.ac.uk/Info/pdf-faq.html
 
Old 06-23-2011, 12:22 PM   #4
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
I would use 'pdftoppm' to convert pdfs to images.

However, why do you need to do this ? I don't even deal with pdfs, I just convert them to djvu, delete the pdf, and work with the djvu.
 
Old 06-24-2011, 08:05 PM   #5
Dogs
Member
 
Registered: Aug 2009
Location: Houston
Distribution: Slackware 13.37 x64
Posts: 105

Original Poster
Rep: Reputation: 25
I bought an ebook with DRM software required to use it. I have found a way to get around the DRM software, but the quality issue prevents me from satisfactorily using OCR software to make image into text.

My current point is: I have a free PDF that is of high quality, but I am unable to OCR the PDF directly because of some kind of layering mechanism...

This, as far as I can tell, layer is the only thing the OCR software is able to "see", and the only thing on this layer is an embedded e-mail address: Thus, OCR gives me pages upon pages that contain only an e-mail address, when what I'm looking at is clearly pages in the book I purchased (which conveniently left out the part about DRM until AFTER the purchase. It is only available from the publisher anyway, so it's not like I have a choice if I want an ebook)...

However, if I split the PDF into pages and/or flatten it and/or convert it to image files, then I can OCR that just fine if the quality is sufficient.

What's cool is: If I open the PDF in the ghostscript viewer, I can save individual pages as excellent copies with the layering mechanism mitigated. Now just to figure out how to automatically split 675 pages...


the gs command provided by Mr. Smoker seems to be just what the doctor ordered, however, I haven't had time to figure out which device to use if pdfwrite isn't available.

Last edited by Dogs; 06-25-2011 at 12:51 AM.
 
Old 06-25-2011, 03:19 AM   #6
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Quote:
Originally Posted by Dogs View Post
What's cool is: If I open the PDF in the ghostscript viewer, I can save individual pages as excellent copies with the layering mechanism mitigated. Now just to figure out how to automatically split 675 pages...


the gs command provided by Mr. Smoker seems to be just what the doctor ordered, however, I haven't had time to figure out which device to use if pdfwrite isn't available.
Just write a script for it that will extract all those pages.
 
Old 06-25-2011, 04:40 PM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by Dogs View Post
the gs command provided by Mr. Smoker seems to be just what the doctor ordered, however, I haven't had time to figure out which device to use if pdfwrite isn't available.
This one ;}




Cheers,
Tink
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Using convert from imagemagick package to convert multiple jpg files janis_169 Linux - Newbie 5 07-18-2010 04:22 PM
imagemagick is rotating my images when it should not be xmrkite Linux - Software 6 07-16-2009 03:22 PM
Imagemagick import utility hangs troelskn *BSD 0 06-28-2009 07:00 PM
Making images paler with ImageMagick TrashCanMan Linux - Software 2 10-17-2004 10:50 AM
how to install 'convert' utility for images? jimzwang Linux - Software 3 09-09-2003 07:14 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 10:26 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration