LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Perl: handling of UTF-8 in XML and HTML (https://www.linuxquestions.org/questions/programming-9/perl-handling-of-utf-8-in-xml-and-html-666872/)

jot 09-01-2008 11:23 AM

Perl: handling of UTF-8 in XML and HTML
 
Not sure where to start.

Overview: I have a big file of Western-encoded messages, you could call it some sorta non-standard blog, marked-up with HTML. Am now trying to clean up things, storing the messages as XML (more specific: as RSS 2.0), and displaying them as HTML. The encoding should change to UTF-8.

Most of the things work, just the UTF-8 encoding of special entities for the XML drives me nuts. I have been deploying the XML::RSS Perl module, which might or might not be a good idea after testing it. E.g. sometimes the encode_output switch is being ignored depending on which server I execute the script.

It also seems XML::RSS does not correctly support the common way of encoding/decoding UTF-8 entities. In those cases where the mentioned "encode_output" of XML::RSS does work it produces something like this for the lower-case 'a' with two dots on top:
Code:

ä
When XML::RSS reads in entities like this it gets correctly decoded. But common RSS readers are not swallowing it.

I have read _thousands_ of websites on the topic, and it _seems_ that the encoding for the above example should have been:
Code:

쎤
When XML::RSS reads entities like this, something goes wrong and I see funny characters as result. After some frustration I had written the encoding myself like the second version, which solves the encoding part but XML::RSS does not like it.

Question 1: Are both ways above correct when encoding UTF-8 in XML?

Question 2: Is using XML::RSS a bad idea? Any alternatives?

Question 3: How to best encode those entities in HTML for output?

Question 4: Could it be that RSS readers better support decimal encoding, e.g.
Code:


than the hexadecimal one, e.g.
Code:

쎤
?



Thanks for any hints!! Am stuck here and feel oblivious.

chrism01 09-02-2008 06:31 PM

That's some fairly specialized qns. While you're waiting here you may want to also ask at www.perlmonks.org. Its where the Perl gurus hang out.
But do post the soln here when you get it so we all benefit.

jot 07-26-2009 01:14 AM

too old Perl
 
Btw, it slowly turned out this was all due to an old version of Perl and the corresponding old Perl modules. Things have probably been fixed in newer Perl releases.

As I had to run things on some provider's server, I could not influence the Perl version deployed. So I gave up on making this work. I might just switch to PHP with my software which is usually better updated by hosting providers. This type of things (UTF encoding etc) usually work smoother in PHP in my experience.

Sergei Steshenko 07-26-2009 02:15 AM

Quote:

Originally Posted by jot (Post 3620447)
Btw, it slowly turned out this was all due to an old version of Perl and the corresponding old Perl modules. Things have probably been fixed in newer Perl releases.

As I had to run things on some provider's server, I could not influence the Perl version deployed. So I gave up on making this work. I might just switch to PHP with my software which is usually better updated by hosting providers. This type of things (UTF encoding etc) usually work smoother in PHP in my experience.

You could.

perl-5.10.0 is relocatable/portable (when you build it correspondingly), i.e. you could build your own version of perl-5.10.0 in whatever directory you have write permission to and use it either from the place it's been built or from any other directory you could copy it to.

jot 07-26-2009 05:04 AM

Thanks Sergei, an option one could consider. And in any case, I still am a Perl fan.

Sergei Steshenko 07-26-2009 02:53 PM

Quote:

Originally Posted by jot (Post 3620588)
Thanks Sergei, an option one could consider. And in any case, I still am a Perl fan.

I am more and more using self-built perl-5.10.0 on my own SUSE 10.3 box - find it more convenient; sooner or later will have to deliver a self-contained tarball (Perl + my app), etc.


All times are GMT -5. The time now is 09:22 PM.