LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 03-07-2009, 03:43 PM   #1
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Rep: Reputation: 58
Convert pdf to tif


I have a pdf file that contains one A4 page of text, and was prepared by OpenOffice.
The letters in the pdf look great even after zooming in by a factor of 4. Its file size is small (cca. 50kByte), so I think it does not contain bitmap fonts.

When I create a 300 dpi, B&W tif image of the pdf, with this command:

convert arial.normal.pdf -monochrome -resample 300x300 arial.normal.tif

the result is terrible: the letters have no distinguishable outlines, they are just composed of scattered points, with the density of black points being just a bit higher than that of white points inside the possible outlines of the letters of the text, yet the letters are hardly legible at 300dpi, and many letters with similar shape are indistinguishable. Anyway, I expected a much better result at such a high resolution.

How could I convert the text pdf to a good quality, 300x300 dpi, B&W image?
Or is it possible, that even the input pdf file has a low-resolution bitmap font, despite of its small file size?
Actually, I have the text in .odt files, so, if the intermediate pdf format could be avoided for creation of high quality images, it would be a much better solution.

Last edited by J_Szucs; 03-07-2009 at 04:04 PM.
 
Old 03-08-2009, 10:10 AM   #2
b0uncer
Guru
 
Registered: Aug 2003
Distribution: CentOS, OS X
Posts: 5,131

Rep: Reputation: Disabled
Yup, it seems convert does an ugly job (with a few pdf files I had lying around). But did you try GhostScript? At least for me that did a better job, with the same files convert didn't. I tried it with a command like this:

Code:
gs -sDEVICE=tiffg4 -r300x300 -sOutputFile=output.tif -- input.pdf
where the "device" now tells ghostscript to produce a tif file. The second option is resolution, as you may guess.. If it works, then also see

Code:
man gs
for more options you can set, if you need. And note that the files I tried this with were single-page pdfs, one at a time, so if you have multipage originals, work with several files at once (batch processing) or something else fancy, see the man page for details on how to get the desired result (so you won't end up with anything insane, like 100 pages of pdf in a huge one-page tiff).

Last edited by b0uncer; 03-08-2009 at 10:16 AM.
 
Old 03-08-2009, 10:33 AM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
I believe the problem is your "-monochrome" option. I just tested it myself with and without it, and the results were very different. For some reason when you use the monochrome setting the antialiasing half-tone pixels are getting lost. Try using "-colorspace gray" instead. That works for me.
 
Old 03-08-2009, 11:36 AM   #4
b0uncer
Guru
 
Registered: Aug 2003
Distribution: CentOS, OS X
Posts: 5,131

Rep: Reputation: Disabled
Quote:
Originally Posted by David the H. View Post
For some reason when you use the monochrome setting the antialiasing half-tone pixels are getting lost. Try using "-colorspace gray" instead. That works for me.
Interesting.. I tried with that (and without -colorspace too), but still get bad results, using the same files as earlier. Could it be that my viewer (eog 2.24.1 EDIT: probably not, other programs show the same..) is just not showing it right, or could it depend on the original pdf file so some worked and others didn't?

Last edited by b0uncer; 03-08-2009 at 11:37 AM.
 
Old 03-08-2009, 03:51 PM   #5
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Original Poster
Rep: Reputation: 58
What I finally settled with (before reading your posts) is:

gs -q -dBATCH -dMaxBitmap=300000000 -dNOPAUSE -dSAFER -sDEVICE=tiffpack -g3600x3600 -r300x300 -dFirstPage=1 -dLastPage=1 -sOutputFile=hun.arial.normal.tif hun.arial.normal.pdf -c quit

Actually, I suspect that tiffpack is not the best device, and the tiffg4 device that you proposed would have been better - supposed that it also generates B&W images. I suspect that others used tiffg4 images for training the tesseract OCR engine.

Anyway, tesseract could be trained to the Hungarian language with the tiffpack images, too, though it throwed a non fatal error when reading those images. The result with the trained tesseract is great now: 99.9% accuracy with images containing Hungarian text with Arial 10pt accented characters generated from pdf by ghostcript (and half of the errors are english words not in the Hungarian dictionary).

Previously I was rather frustrated by the difficulties of creating support for new languages, but now that I succeeded with it and see the results, I must admit, that tesseract might be very usable as is. (For training, one just has to use the pre-2.04 svn sources, and not the <=2.03 tesseract source releases, as the latters have unpatched bugs that make them unusable)

As for convert: I already tried "-colorspace gray -depth 2" to create B&W images, but the result was no way close to that produced by ghostscript.
This is a bit annoying, because I planned to use convert to pre-process images to the required quality for the tesseract OCR engine. Now I am in doubt in this respect...

Edit:
Changing the order of parameters on the command line like this:
convert -density 300 input.pdf -resample 300 -monochrome output.pdf
gives a better result, but it is still much worse than that with gs, plus it takes 20 times longer. It is said that -density 900 ... -resample 300 would give better results, but convert is so slow even with a density of 300, that it would be unusable.

Last edited by J_Szucs; 03-09-2009 at 04:25 AM.
 
Old 07-07-2009, 10:31 AM   #6
Marel
Member
 
Registered: May 2005
Location: Serbia
Distribution: Debian, Ubuntu, Red Hat, Gentoo
Posts: 64

Rep: Reputation: 15
Has someone found a way to do high resolution conversion od pdf into images and preserve colour?
 
Old 07-07-2009, 12:52 PM   #7
Marel
Member
 
Registered: May 2005
Location: Serbia
Distribution: Debian, Ubuntu, Red Hat, Gentoo
Posts: 64

Rep: Reputation: 15
I found it.

Code:
-sDEVICE=tiff24nc
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Convert RTF to PDF hypernetics Linux - Software 6 04-04-2014 04:46 AM
How to convert .info to .pdf? huweiming268 Linux - Software 2 09-19-2012 10:09 PM
Convert pdf to txt problems J_Szucs Linux - Software 7 02-15-2009 01:02 PM
Convert pdf to html or txt or remaster the pdf? jago25_98 Linux - Software 1 12-13-2005 01:11 AM
PHP: convert to PDF mikeshn Programming 1 10-01-2003 06:23 AM


All times are GMT -5. The time now is 04:48 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration