LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 09-01-2009, 05:40 AM   #1
oskbur
LQ Newbie
 
Registered: Sep 2009
Posts: 3

Rep: Reputation: 0
undo / replace ligature


Hi

Is it possible to remove and replace all ligatures like fi and fl with the actual letters in a pdf file? When I copy/paste text from some pdf files, ligatures are exported as 001E and 001F and impossible to know what letters it should be.

So I am thinking that if I could convert the ligatures before I copy/paste then my problem would be solved. pdftotext can extract the real letters but I need the pdf structure intact. If I use evince the letters is correct but evince can not copy the text in the right order. I use Win XP when I do the copying but all pdf pages is on linux servers so it would be easy to convert pages before they are shared with samba. (If I knew how)

The font that is used in the pdf is OpenType.

If it is possible I would guess that it is ghostscript that should do it but I can't find out how.

/Oskar
 
Old 09-30-2009, 09:32 PM   #2
lsaffre
LQ Newbie
 
Registered: Jul 2008
Location: Estonia
Distribution: Debian, Ubuntu
Posts: 8

Rep: Reputation: 0
Hi Oskar,

I don't believe that GhostScript can do it.
I'd rather try to use pdftotext on the server to create a .txt version of each .pdf file. Clients can then choose themselves what they want.

Luc
 
Old 10-01-2009, 01:42 AM   #3
oskbur
LQ Newbie
 
Registered: Sep 2009
Posts: 3

Original Poster
Rep: Reputation: 0
No, that is not an option. Pdftotext sometimes mixes lines from different columns so it is impossible to follow.

Any other ideas?
 
Old 10-02-2009, 12:48 AM   #4
lsaffre
LQ Newbie
 
Registered: Jul 2008
Location: Estonia
Distribution: Debian, Ubuntu
Posts: 8

Rep: Reputation: 0
I still think that GhostScript cannot solve this problem. What you need is something that analyzes a PDF, similar to OCR software. pdftotext basically should do exactly this. And if it fails in some cases, you should go to http://poppler.freedesktop.org/, find the people who maintain the pdftotext module, and submit some cases where it mixes up lines.

But note that I'm not a pdf expert, so don't take my advice as a gospel...

Luc
 
Old 10-04-2009, 11:06 AM   #5
oskbur
LQ Newbie
 
Registered: Sep 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Ok, thank you for your reply. I will consider that.

/Oskar
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
rm -rf * , want to undo this postiwala Linux - General 6 04-21-2007 05:45 PM
LXer: The Road to KDE 4: Okular and Ligature Document Viewers LXer Syndicated Linux News 0 02-14-2007 05:46 PM
VI and UNDO jinksys Linux - Software 1 07-30-2005 04:46 PM
Is C++ a ligature? rjlee General 4 06-16-2005 04:57 AM
problem in perl replace command with slash (/) in search/replace string ramesh_ps1 Red Hat 4 09-10-2003 01:04 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 03:45 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration