LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 02-14-2009, 07:12 AM   #1
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Rep: Reputation: 58
Convert pdf to txt problems


I have a pdf file that has text with Hungarian accented characters, and is:
- correctly displayed in xpdf and
- in kpdf,
- but AdobeReader 7.0 throws "Cannot find or generate TimesNewRoman, Bold font" error, and replaces all non-accented characters with a dot on the screen. (MS True Type fonts are installed in KDE)

I want to extract the text from this pdf.

- pdftotext version 3.01:
Extracts the text, but very often inserts spaces inside a word, deviding it into two or more words. Sometimes it mixes up the order of syllables, too, so the resulting text is unacceptable.
- pdftohtml version 0.36:
The same error
- AdobeReader: cannot even display the pdf corretly, it is of no use

Is there an other - even manual - way to extract the text correctly?

Last edited by J_Szucs; 02-14-2009 at 07:14 AM.
 
Old 02-14-2009, 10:41 AM   #2
jdkaye
LQ Guru
 
Registered: Dec 2008
Location: Westgate-on-Sea, Kent, UK
Distribution: Debian Testing Amd64
Posts: 5,465

Rep: Reputation: Disabled
What about just copying the whole text from, say, kpdf and then pasting into an editor? Just go to the "Tools" menu in kpdf and choose the "Select Tool" and copy the whole text.
Cheers,
jdk
 
Old 02-15-2009, 08:40 AM   #3
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Original Poster
Rep: Reputation: 58
The text is 300+ pages, and i have many such files, whilst kpdf only makes it possible to copy one page at a time. It would take days till I process them all.

An other issue: i also have some files that are copy-protected, but printing-allowed. So I print them into pdf (from kpdf and acroread), and try to copy text out of the newly generated pdf.
Here is what I get:
"De miért itt? $ Np] JpSLHVHQ EHFV~V]LN D ]DNy DOi D IHJ\YHUWRNEDQ WDUWRWW %HUHWWához. Hirtelen feltámad a szél, és végigsodor a szokatlanul kihalt utcán néhány IDOHYHOHW PHJ YDODPL HOGRERWW SDStUIHFQLW 0LQWKD HJ\ NLFVLW V|WpWHEE LV OHQQH GH KiW SHUV]H HVWH YDQ PiU KRYi WHWWH D] HV]pW $ V]pO YLV]RQW téliesen jeges – IXUFVDtJ\PiMXVN|]HSpQ0HJERU]RQJ"

The text of the newly-generated pdf is correct in the viewer, but incorrect when copied out. Pdftotext gives the same result.

What the heck is this?

P.S:
PSRESOURCEPATH is set for acroread, and the same path is given on the command line (-sFONTPATH) to the gs backend working as a pdf printer.
 
Old 02-15-2009, 09:08 AM   #4
jdkaye
LQ Guru
 
Registered: Dec 2008
Location: Westgate-on-Sea, Kent, UK
Distribution: Debian Testing Amd64
Posts: 5,465

Rep: Reputation: Disabled
Quote:
Originally Posted by J_Szucs View Post
What the heck is this?
It looks like incorrectly rendered unicode characters to me. It is possible to drag the rectangle in kpdf over more than one page. I wouldn't fancy doing it for 300 pages though. I'm out of ideas.
good luck.
jdk
 
Old 02-15-2009, 09:50 AM   #5
tredegar
LQ 5k Club
 
Registered: May 2003
Location: London, UK
Distribution: Debian "Testing"
Posts: 6,116

Rep: Reputation: 416Reputation: 416Reputation: 416Reputation: 416Reputation: 416
This thread http://www.linuxquestions.org/questi...erters-702769/ may help you.
 
Old 02-15-2009, 10:40 AM   #6
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Original Poster
Rep: Reputation: 58
The ocr solutions would be cool, but gocr s*cks, whilst tesseract has no Hungarian language support yet. (I plan to do it one day, though).
So, I will try the OpenOffice plugin now. Thx.
 
Old 02-15-2009, 10:54 AM   #7
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Original Poster
Rep: Reputation: 58
Doh. That extension is for OO 3.0. There is no 3.0 version for SuSE 10.1.

So it is the good old "Please upgrade your operating system and buy new hardware to view this file" case again.
I did it once, when I badly needed a firefox plugin, but no mood to do it again.

Last edited by J_Szucs; 02-15-2009 at 10:57 AM.
 
Old 02-15-2009, 01:02 PM   #8
J_Szucs
Senior Member
 
Registered: Nov 2001
Location: Budapest, Hungary
Distribution: SuSE 6.4-11.3, Dsl linux, FreeBSD 4.3-6.2, Mandrake 8.2, Redhat, UHU, Debian Etch
Posts: 1,126

Original Poster
Rep: Reputation: 58
I have access to another system, running SuSE 11.0. I installed OO 3.0 there, and also the plugin. Of course the plugin install was not smooth, a little googling here, a little source downloading there, searching for and loading libraries here and there, and the pdfimport plugin was installed in not more than 2 hours.

Now it failed to import one of the pdf's in question, because it was encrypted. At least it was fast to fail with that file. But then it started to import an other pdf a half of an hour ago, and the progress bar is at 20% on a PIV 2.4G CPU now.

These "tools" are only good for wasting time...

Edit:
The pdfimport plugin finished finally. Of course the result is good for nothing, as it invented the best way by which neither text nor layout is preserved from the pdf file: a hundred thousand text boxes are scattered over the pages of the converted file. These text boxes often contain just a single character. And neither the resulting file can be saved as text, nor selecting all and copying on the clipboard is possible.

Last edited by J_Szucs; 02-15-2009 at 02:46 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
.PDF and .CHM to .TXT converter moljac024 Linux - Software 3 08-27-2007 08:08 PM
Convert pdf to html or txt or remaster the pdf? jago25_98 Linux - Software 1 12-13-2005 01:11 AM
Convertible... Converts.... No Wait - I know! Conversion Utilites (TXT/PDF) Nimoy Linux - Software 4 08-30-2003 01:41 PM
convert .doc to .txt using C++ ckamheng Programming 2 06-21-2003 08:25 AM
perl reading pdf,ps,txt j-ray Programming 1 02-04-2003 10:49 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 07:11 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration