LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   xpdf Won't Play Nice With UTF-8 (https://www.linuxquestions.org/questions/linux-software-2/xpdf-wont-play-nice-with-utf-8-a-820987/)

gacl 07-20-2010 08:12 AM

xpdf Won't Play Nice With UTF-8
 
Hello,

I use Vector Linux 6 (based on Slackware 12.1) and I set the encoding to UTF-8 because I use Spanish characters. The problem is that when I try to open files with accent marks, for instance, with xpdf they all look garbled. How can I get xpdf to display the names correctly? Thanks.

Gus

selfprogrammed 07-23-2010 04:28 PM

Describe garbled.

The usual is that there are substitute glyphs for the wanted characters, which would mean that the font that xpdf is using does not have the spanish characters in it. It is possible to have spanish fonts on your system and yet have programs that do not know enough to use them.
I think many PDF define their own fonts in the file, and it may be the fault of the PDF file.

sneakyimp 07-23-2010 05:19 PM

I think most spanish chars are available in the Latin 1 charset. I also think most latin chars are two-byte chars in UTF8 encoding and only one in ASCII or Latin 1. It may be that xpdf doesn't understand utf-8 encoded chars. xpdf may have an INI setting or preference you can change or maybe there's a flag you can set somewhere to use utf-8 encoding. If not, you should probably set your encoding to something xpdf understands.

gacl 07-24-2010 08:08 AM

The content of the files is displayed correctly but not the names. For instance, a file named résumé will be displayed as rÄ©sumÄ© in the open box and in the title bar. I've played around with the .xpdfrc file without success.

sneakyimp 07-24-2010 08:48 AM

Quote:

Originally Posted by gacl (Post 4043894)
The content of the files is displayed correctly but not the names. For instance, a file named résumé will be displayed as rÄ©sumÄ© in the open box and in the title bar. I've played around with the .xpdfrc file without success.

Sounds to me like xpdf can play nice with utf8 but that your file system --or whatever system translates a filename into something that appears on your screen -- might not. "é", is a 2-byte char when utf-8 encoded. It's a one-byte char in latin-1 so when these systems that don't understand utf-8 look at the filename, they think each é is two other chars.

selfprogrammed 07-24-2010 01:58 PM

The sensors program prints out a degrees symbol. On a console it prints some garbage character, but on an x console it prints a degree.
Man pages have had some strange character in them (for years) that does not print on consoles.

XPDF is displaying file names (open selection) using its own window and font.
For filenames, it is likely choosing one of the default fonts setup by the KDE system controls.
KDE has a control setup program in the main KDE menu.
Hunt down the KDE display properties and the fonts that are setup.
There are different fonts for different uses and various sizes.
Make sure KDE is setup with fonts that have all the characters you need.
If you use Gnome or any other window manager then same thing.

Still cannot find it ??
Write down the settings for all the KDE fonts.
Set them all to some weird easily recognizable font, different for each one.
See if any of the weird fonts show up in XPDF.

Still cannot find it ??
Look at the XPDF font closly and write down the usual characteristics.
Serif or San-serif, how the 'm' is made, the 'g', the 'j',
note the 'ae' spacing, and what glyph it displays for the special spanish characters.
Use the font selector program and go through all the fonts looking for an identical font.
If you find one then disable it.
When XPDF is forced to use a different font then you have found it.

If you have disabled all the candidates and cannot change XPDF then it must be using an internal font. Some programs do that, but it would be very strange for an x-window program. They cannot be fixed except by getting an updated program.

Get a copy of the XPDF source. Many distributions have them.
Go into the source and find the font used to display filenames.
Fixing it depends upon your programming skills and how badly it is built-in, and you may find something entirely different.

Try a different PDF viewer, there are more than one.

Addendum: Did strings on XPDF last night, and did NOT see any font names, but did see font function calls.
Looked at KDE fonts, and the file listing font looks like half of them. Was not motivated enough to mess up my own fonts trying to find out which was being used.

There is a KDE tool to look at all UTF characters. Check the spanish characters and see if they are one or two byte encodings.
I think all of ASCII and the latin extensions to it encode as one byte. UTF-8 only goes to 2 bytes (and more) for extension pages for
the eastern, african, asian, oriental, arabic, and other non-latin languages (and Klingon).

selfprogrammed 07-26-2010 11:54 AM

Checked the KDE character map last night. Searched for some Spanish characters and found four. I was surprised to see that they are giving a UTF-8 encoding using two bytes, with one having three bytes.
They have UTF-16 values that are well under 256.
Having written a document on the coding of UTF-8, from what I remember, there are multiple ways to encode a particular character under UTF-8, with one being considered canonical. Unfortunately, some canonical systems differ.
It does not matter, the creator of the filenames decided which UTF-8 encoding was used and you are seeing one glyph or two glyphs.

My Character map showed the Spanish character glyphs, so my default fonts
in Slackware Linux 2.6.33 have those characters.

Going through the XPDF docs (/usr/docs/xpdf-*) there is a long Changlog file that lists many Unicode documents. They were very systematic about using their use of Unicode and their knowledge of it. But it is possible that they did not consider Unicode in filenames.

I am a little suspicious of that filename list. It looks as if they might be using some tool (from KDE or gtk) to do the open file. Many of the tools in KDE display similar boxes for open-file.
This does not help you much, but points out that if might not be XPDF that is messing up the filenames.

If you post some of the bad filenames then we could play with them too.
But we are unlikely to solve this without examining the XPDF source code.
A bug report to the XPDF support team might be in order, because they would know how they got the filenames displayed. See their /usr/docs for contact info.

gacl 07-30-2010 03:23 PM

Sneakyimp, I think the filesystem is OK because Thunar displays the characters in question just fine.

Selfprogrammed, I do use Evince but XPDF is much lighter and faster. VectorLinux uses XFCE. When I go to "Keyboard Preferences" I can easily type accented characters in the test area.

It seems that this is a recognized bug? (Link: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=422346)

Thanks.


Gus

selfprogrammed 07-31-2010 03:13 PM

I read the bug report. The substitute characters with ~ must be part of the Latin expansion applied to many character sets, which means that the glyphs are in the font, but it is using the wrong encoding to get to them.
It looks like trying to decode UTF-8 using IBM-Locale page, with double characters because it is not UTF-8 decoding the two bytes.
Probably could track down which IBM locale it is using, but it would not
be of much use. Changing your Locale would not help either, it is missing the UTF-8 decoding.

I don't think I can be of much more use, and it looks like a job for the xpdf dev team. Sorry. Disconnecting from thread.

gacl 07-31-2010 06:07 PM

Thank you anyway.


Gus


All times are GMT -5. The time now is 03:25 AM.