LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-01-2008, 11:23 AM   #1
jot
LQ Newbie
 
Registered: Aug 2004
Location: Singapore
Distribution: Ubuntu and Fedora
Posts: 25

Rep: Reputation: 0
Question Perl: handling of UTF-8 in XML and HTML


Not sure where to start.

Overview: I have a big file of Western-encoded messages, you could call it some sorta non-standard blog, marked-up with HTML. Am now trying to clean up things, storing the messages as XML (more specific: as RSS 2.0), and displaying them as HTML. The encoding should change to UTF-8.

Most of the things work, just the UTF-8 encoding of special entities for the XML drives me nuts. I have been deploying the XML::RSS Perl module, which might or might not be a good idea after testing it. E.g. sometimes the encode_output switch is being ignored depending on which server I execute the script.

It also seems XML::RSS does not correctly support the common way of encoding/decoding UTF-8 entities. In those cases where the mentioned "encode_output" of XML::RSS does work it produces something like this for the lower-case 'a' with two dots on top:
Code:
ä
When XML::RSS reads in entities like this it gets correctly decoded. But common RSS readers are not swallowing it.

I have read _thousands_ of websites on the topic, and it _seems_ that the encoding for the above example should have been:
Code:
쎤
When XML::RSS reads entities like this, something goes wrong and I see funny characters as result. After some frustration I had written the encoding myself like the second version, which solves the encoding part but XML::RSS does not like it.

Question 1: Are both ways above correct when encoding UTF-8 in XML?

Question 2: Is using XML::RSS a bad idea? Any alternatives?

Question 3: How to best encode those entities in HTML for output?

Question 4: Could it be that RSS readers better support decimal encoding, e.g.
Code:
than the hexadecimal one, e.g.
Code:
쎤
?



Thanks for any hints!! Am stuck here and feel oblivious.

Last edited by jot; 09-01-2008 at 11:35 AM. Reason: my example entities got translated
 
Old 09-02-2008, 06:31 PM   #2
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,232

Rep: Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024Reputation: 2024
That's some fairly specialized qns. While you're waiting here you may want to also ask at www.perlmonks.org. Its where the Perl gurus hang out.
But do post the soln here when you get it so we all benefit.
 
Old 07-26-2009, 01:14 AM   #3
jot
LQ Newbie
 
Registered: Aug 2004
Location: Singapore
Distribution: Ubuntu and Fedora
Posts: 25

Original Poster
Rep: Reputation: 0
Arrow too old Perl

Btw, it slowly turned out this was all due to an old version of Perl and the corresponding old Perl modules. Things have probably been fixed in newer Perl releases.

As I had to run things on some provider's server, I could not influence the Perl version deployed. So I gave up on making this work. I might just switch to PHP with my software which is usually better updated by hosting providers. This type of things (UTF encoding etc) usually work smoother in PHP in my experience.
 
Old 07-26-2009, 02:15 AM   #4
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by jot View Post
Btw, it slowly turned out this was all due to an old version of Perl and the corresponding old Perl modules. Things have probably been fixed in newer Perl releases.

As I had to run things on some provider's server, I could not influence the Perl version deployed. So I gave up on making this work. I might just switch to PHP with my software which is usually better updated by hosting providers. This type of things (UTF encoding etc) usually work smoother in PHP in my experience.
You could.

perl-5.10.0 is relocatable/portable (when you build it correspondingly), i.e. you could build your own version of perl-5.10.0 in whatever directory you have write permission to and use it either from the place it's been built or from any other directory you could copy it to.
 
Old 07-26-2009, 05:04 AM   #5
jot
LQ Newbie
 
Registered: Aug 2004
Location: Singapore
Distribution: Ubuntu and Fedora
Posts: 25

Original Poster
Rep: Reputation: 0
Wink

Thanks Sergei, an option one could consider. And in any case, I still am a Perl fan.
 
Old 07-26-2009, 02:53 PM   #6
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by jot View Post
Thanks Sergei, an option one could consider. And in any case, I still am a Perl fan.
I am more and more using self-built perl-5.10.0 on my own SUSE 10.3 box - find it more convenient; sooner or later will have to deliver a self-contained tarball (Perl + my app), etc.
 
  


Reply

Tags
encoding, html, perl, rss, utf8, xml


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
checking for XML::Parser... configure: error: XML::Parser perl module is required for kornerr Linux - General 11 11-16-2008 07:24 AM
help!! sendmail html MIME handling skyflakes690 Linux - General 0 05-16-2006 03:30 AM
text to xml to html osio Programming 5 07-28-2005 12:39 PM
mod_python: Handling Uploads from HTML Form nko Programming 0 04-12-2005 05:33 PM
cgi perl : I cant get perl to append my html file... the_y_man Programming 3 03-22-2004 05:07 AM


All times are GMT -5. The time now is 05:46 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration