LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   undo / replace ligature (https://www.linuxquestions.org/questions/linux-software-2/undo-replace-ligature-751740/)

oskbur 09-01-2009 05:40 AM

undo / replace ligature
 
Hi

Is it possible to remove and replace all ligatures like fi and fl with the actual letters in a pdf file? When I copy/paste text from some pdf files, ligatures are exported as 001E and 001F and impossible to know what letters it should be.

So I am thinking that if I could convert the ligatures before I copy/paste then my problem would be solved. pdftotext can extract the real letters but I need the pdf structure intact. If I use evince the letters is correct but evince can not copy the text in the right order. I use Win XP when I do the copying but all pdf pages is on linux servers so it would be easy to convert pages before they are shared with samba. (If I knew how)

The font that is used in the pdf is OpenType.

If it is possible I would guess that it is ghostscript that should do it but I can't find out how.

/Oskar

lsaffre 09-30-2009 09:32 PM

Hi Oskar,

I don't believe that GhostScript can do it.
I'd rather try to use pdftotext on the server to create a .txt version of each .pdf file. Clients can then choose themselves what they want.

Luc

oskbur 10-01-2009 01:42 AM

No, that is not an option. Pdftotext sometimes mixes lines from different columns so it is impossible to follow.

Any other ideas?

lsaffre 10-02-2009 12:48 AM

I still think that GhostScript cannot solve this problem. What you need is something that analyzes a PDF, similar to OCR software. pdftotext basically should do exactly this. And if it fails in some cases, you should go to http://poppler.freedesktop.org/, find the people who maintain the pdftotext module, and submit some cases where it mixes up lines.

But note that I'm not a pdf expert, so don't take my advice as a gospel...

Luc

oskbur 10-04-2009 11:06 AM

Ok, thank you for your reply. I will consider that.

/Oskar


All times are GMT -5. The time now is 07:12 PM.