Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I have a pdf file that contains one A4 page of text, and was prepared by OpenOffice.
The letters in the pdf look great even after zooming in by a factor of 4. Its file size is small (cca. 50kByte), so I think it does not contain bitmap fonts.
When I create a 300 dpi, B&W tif image of the pdf, with this command:
the result is terrible: the letters have no distinguishable outlines, they are just composed of scattered points, with the density of black points being just a bit higher than that of white points inside the possible outlines of the letters of the text, yet the letters are hardly legible at 300dpi, and many letters with similar shape are indistinguishable. Anyway, I expected a much better result at such a high resolution.
How could I convert the text pdf to a good quality, 300x300 dpi, B&W image?
Or is it possible, that even the input pdf file has a low-resolution bitmap font, despite of its small file size?
Actually, I have the text in .odt files, so, if the intermediate pdf format could be avoided for creation of high quality images, it would be a much better solution.
Yup, it seems convert does an ugly job (with a few pdf files I had lying around). But did you try GhostScript? At least for me that did a better job, with the same files convert didn't. I tried it with a command like this:
where the "device" now tells ghostscript to produce a tif file. The second option is resolution, as you may guess.. If it works, then also see
for more options you can set, if you need. And note that the files I tried this with were single-page pdfs, one at a time, so if you have multipage originals, work with several files at once (batch processing) or something else fancy, see the man page for details on how to get the desired result (so you won't end up with anything insane, like 100 pages of pdf in a huge one-page tiff).
I believe the problem is your "-monochrome" option. I just tested it myself with and without it, and the results were very different. For some reason when you use the monochrome setting the antialiasing half-tone pixels are getting lost. Try using "-colorspace gray" instead. That works for me.
For some reason when you use the monochrome setting the antialiasing half-tone pixels are getting lost. Try using "-colorspace gray" instead. That works for me.
Interesting.. I tried with that (and without -colorspace too), but still get bad results, using the same files as earlier. Could it be that my viewer (eog 2.24.1 EDIT: probably not, other programs show the same..) is just not showing it right, or could it depend on the original pdf file so some worked and others didn't?
Actually, I suspect that tiffpack is not the best device, and the tiffg4 device that you proposed would have been better - supposed that it also generates B&W images. I suspect that others used tiffg4 images for training the tesseract OCR engine.
Anyway, tesseract could be trained to the Hungarian language with the tiffpack images, too, though it throwed a non fatal error when reading those images. The result with the trained tesseract is great now: 99.9% accuracy with images containing Hungarian text with Arial 10pt accented characters generated from pdf by ghostcript (and half of the errors are english words not in the Hungarian dictionary).
Previously I was rather frustrated by the difficulties of creating support for new languages, but now that I succeeded with it and see the results, I must admit, that tesseract might be very usable as is. (For training, one just has to use the pre-2.04 svn sources, and not the <=2.03 tesseract source releases, as the latters have unpatched bugs that make them unusable)
As for convert: I already tried "-colorspace gray -depth 2" to create B&W images, but the result was no way close to that produced by ghostscript.
This is a bit annoying, because I planned to use convert to pre-process images to the required quality for the tesseract OCR engine. Now I am in doubt in this respect...
Changing the order of parameters on the command line like this:
convert -density 300 input.pdf -resample 300 -monochrome output.pdf
gives a better result, but it is still much worse than that with gs, plus it takes 20 times longer. It is said that -density 900 ... -resample 300 would give better results, but convert is so slow even with a density of 300, that it would be unusable.