-   Linux - Software (
-   -   undo / replace ligature (

oskbur 09-01-2009 05:40 AM

undo / replace ligature

Is it possible to remove and replace all ligatures like fi and fl with the actual letters in a pdf file? When I copy/paste text from some pdf files, ligatures are exported as 001E and 001F and impossible to know what letters it should be.

So I am thinking that if I could convert the ligatures before I copy/paste then my problem would be solved. pdftotext can extract the real letters but I need the pdf structure intact. If I use evince the letters is correct but evince can not copy the text in the right order. I use Win XP when I do the copying but all pdf pages is on linux servers so it would be easy to convert pages before they are shared with samba. (If I knew how)

The font that is used in the pdf is OpenType.

If it is possible I would guess that it is ghostscript that should do it but I can't find out how.


lsaffre 09-30-2009 09:32 PM

Hi Oskar,

I don't believe that GhostScript can do it.
I'd rather try to use pdftotext on the server to create a .txt version of each .pdf file. Clients can then choose themselves what they want.


oskbur 10-01-2009 01:42 AM

No, that is not an option. Pdftotext sometimes mixes lines from different columns so it is impossible to follow.

Any other ideas?

lsaffre 10-02-2009 12:48 AM

I still think that GhostScript cannot solve this problem. What you need is something that analyzes a PDF, similar to OCR software. pdftotext basically should do exactly this. And if it fails in some cases, you should go to, find the people who maintain the pdftotext module, and submit some cases where it mixes up lines.

But note that I'm not a pdf expert, so don't take my advice as a gospel...


oskbur 10-04-2009 11:06 AM

Ok, thank you for your reply. I will consider that.


All times are GMT -5. The time now is 07:12 PM.