Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
02-14-2009, 08:12 AM
|
#1
|
Senior Member
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126
Rep:
|
Convert pdf to txt problems
I have a pdf file that has text with Hungarian accented characters, and is:
- correctly displayed in xpdf and
- in kpdf,
- but AdobeReader 7.0 throws "Cannot find or generate TimesNewRoman, Bold font" error, and replaces all non-accented characters with a dot on the screen. (MS True Type fonts are installed in KDE)
I want to extract the text from this pdf.
- pdftotext version 3.01:
Extracts the text, but very often inserts spaces inside a word, deviding it into two or more words. Sometimes it mixes up the order of syllables, too, so the resulting text is unacceptable.
- pdftohtml version 0.36:
The same error
- AdobeReader: cannot even display the pdf corretly, it is of no use
Is there an other - even manual - way to extract the text correctly?
Last edited by J_Szucs; 02-14-2009 at 08:14 AM.
|
|
|
02-14-2009, 11:41 AM
|
#2
|
LQ Guru
Registered: Dec 2008
Location: Westgate-on-Sea, Kent, UK
Distribution: Debian Testing Amd64
Posts: 5,465
Rep:
|
What about just copying the whole text from, say, kpdf and then pasting into an editor? Just go to the "Tools" menu in kpdf and choose the "Select Tool" and copy the whole text.
Cheers,
jdk
|
|
|
02-15-2009, 09:40 AM
|
#3
|
Senior Member
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126
Original Poster
Rep:
|
The text is 300+ pages, and i have many such files, whilst kpdf only makes it possible to copy one page at a time. It would take days till I process them all.
An other issue: i also have some files that are copy-protected, but printing-allowed. So I print them into pdf (from kpdf and acroread), and try to copy text out of the newly generated pdf.
Here is what I get:
"De miért itt? $ Np] JpSLHVHQ EHFV~V]LN D ]DNy DOi D IHJ\YHUWRNEDQ WDUWRWW %HUHWWához. Hirtelen feltámad a szél, és végigsodor a szokatlanul kihalt utcán néhány IDOHYHOHW PHJ YDODPL HOGRERWW SDStUIHFQLW 0LQWKD HJ\ NLFVLW V|WpWHEE LV OHQQH GH KiW SHUV]H HVWH YDQ PiU KRYi WHWWH D] HV]pW $ V]pO YLV]RQW téliesen jeges – IXUFVDtJ\PiMXVN|]HSpQ0HJERU]RQJ"
The text of the newly-generated pdf is correct in the viewer, but incorrect when copied out. Pdftotext gives the same result.
What the heck is this?
P.S:
PSRESOURCEPATH is set for acroread, and the same path is given on the command line (-sFONTPATH) to the gs backend working as a pdf printer.
|
|
|
02-15-2009, 10:08 AM
|
#4
|
LQ Guru
Registered: Dec 2008
Location: Westgate-on-Sea, Kent, UK
Distribution: Debian Testing Amd64
Posts: 5,465
Rep:
|
Quote:
Originally Posted by J_Szucs
What the heck is this?
|
It looks like incorrectly rendered unicode characters to me. It is possible to drag the rectangle in kpdf over more than one page. I wouldn't fancy doing it for 300 pages though. I'm out of ideas.
good luck.
jdk
|
|
|
02-15-2009, 10:50 AM
|
#5
|
LQ 5k Club
Registered: May 2003
Location: London, UK
Distribution: Fedora40
Posts: 6,152
|
|
|
|
02-15-2009, 11:40 AM
|
#6
|
Senior Member
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126
Original Poster
Rep:
|
The ocr solutions would be cool, but gocr s*cks, whilst tesseract has no Hungarian language support yet. (I plan to do it one day, though).
So, I will try the OpenOffice plugin now. Thx.
|
|
|
02-15-2009, 11:54 AM
|
#7
|
Senior Member
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126
Original Poster
Rep:
|
Doh. That extension is for OO 3.0. There is no 3.0 version for SuSE 10.1.
So it is the good old "Please upgrade your operating system and buy new hardware to view this file" case again.
I did it once, when I badly needed a firefox plugin, but no mood to do it again.
Last edited by J_Szucs; 02-15-2009 at 11:57 AM.
|
|
|
02-15-2009, 02:02 PM
|
#8
|
Senior Member
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126
Original Poster
Rep:
|
I have access to another system, running SuSE 11.0. I installed OO 3.0 there, and also the plugin. Of course the plugin install was not smooth, a little googling here, a little source downloading there, searching for and loading libraries here and there, and the pdfimport plugin was installed in not more than 2 hours.
Now it failed to import one of the pdf's in question, because it was encrypted. At least it was fast to fail with that file. But then it started to import an other pdf a half of an hour ago, and the progress bar is at 20% on a PIV 2.4G CPU now.
These "tools" are only good for wasting time...
Edit:
The pdfimport plugin finished finally. Of course the result is good for nothing, as it invented the best way by which neither text nor layout is preserved from the pdf file: a hundred thousand text boxes are scattered over the pages of the converted file. These text boxes often contain just a single character. And neither the resulting file can be saved as text, nor selecting all and copying on the clipboard is possible.
Last edited by J_Szucs; 02-15-2009 at 03:46 PM.
|
|
|
All times are GMT -5. The time now is 12:25 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|