LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 11-15-2017, 02:01 PM   #1
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,328

Rep: Reputation: 223Reputation: 223Reputation: 223
Is there a utility to convert UTF to HTML metacharacters?


I convert pages I download from UTF to HTML (I prefer ISO-8859). I use utf8trans, but it relies on a character table, which is incomplete. I've been adding custom entries for years, but that's lame. Converting UTF's 23f0 to ⏰ (for example) would be good enough for me; I could write a script that added all those entries to my character table, which would be inelegant and make it huge. Hmmmm... there's probably a C function to do that, which would make an easy program.
 
Old 11-15-2017, 02:24 PM   #2
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,364
Blog Entries: 3

Rep: Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178
There's a CPAN module for perl that does that already. You just need to specify the range of characters to be encoded. Here's a one-liner to encode x0100 through x2fa1f:

Code:
perl -MHTML::HTML5::Entities -p \
 -e 'print encode_entities($_,"\x{0100}..\x{2FA1F}")'; \
 < input.html > output.html
The package for that module with APT-based distros is libhtml-html5-entities-perl and something similar for RPM
 
1 members found this post helpful.
Old 11-15-2017, 11:34 PM   #3
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,328

Original Poster
Rep: Reputation: 223Reputation: 223Reputation: 223
Thanks.

I have 2 problems: output.html has everything in input.html doubled, like the problem -n cures for sed.

It converts periods to &#x2e, which is correct, but unwanted, and an easy fix with sed.

It isn't in Slackware, but I'm used to building my own Perl modules.

Last edited by RandomTroll; 11-16-2017 at 12:57 AM.
 
Old 11-16-2017, 04:29 AM   #4
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,364
Blog Entries: 3

Rep: Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178
Sorry. The mistake was on my end. map() is needed to build the replacement string. The -C is needed to force I/O as UTF-8. Unicode is still a bit unfamiliar.

Also the -p should be a -n instead:

Code:
perl -CSD -MHTML::HTML5::Entities -n  \
      -e '$unicode=join("",map ({chr} 0x100 .. 0x2FA1F)); \
          print encode_entities($_,$unicode)' \
< input.html > output.html
 
1 members found this post helpful.
Old 11-16-2017, 11:45 AM   #5
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,328

Original Poster
Rep: Reputation: 223Reputation: 223Reputation: 223
Thanks. I have 1 last question: how do I set an environment variable? The man page for this function says,
Quote:
$HTML::HTML5::Entities::hex

This variable controls whether numeric entities will use hexadecimal or decimal notation. It is TRUE (hexadecimal) by default, but can be set to FALSE.
, which I'd like to switch to decimal, but all of the tries I have made to set this variable to false are rejected.
 
Old 11-16-2017, 11:55 AM   #6
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,364
Blog Entries: 3

Rep: Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178
It's just a regular variable for perl, though one in the namespace of a specific module. Just include it in the one-liner at the beginning.

Code:
perl -CSD -MHTML::HTML5::Entities -n \
  -e '$HTML::HTML5::Entities::hex=0; \
      $unicode=join("",map ({chr} 0x100 .. 0x2FA1F)); \
      print encode_entities($_,$unicode)' \
  < input.html > output.html
 
1 members found this post helpful.
Old 11-16-2017, 03:15 PM   #7
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,328

Original Poster
Rep: Reputation: 223Reputation: 223Reputation: 223
While the perl script worked okay for a short test file, it took minutes on a real-life 80K file (that had only 6 UTF characters to convert). I wrote a program to emit all the entries for a complete character table for utf8trans and it only takes a second longer than before. I had feared that a 65K-entry table would slow utf8trans down a lot but I don't notice.
 
Old 11-16-2017, 03:27 PM   #8
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,364
Blog Entries: 3

Rep: Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178
I also only tested on a small data set and the -n and -p options for perl make a loop and the join() and map() functions were inside that loop.

Out of curiosity, does moving them out of the loop speed things up noticeably for the larger data set?

Code:
#!/usr/bin/perl -CSD

use HTML::HTML5::Entities;

use strict;
use warnings;

$HTML::HTML5::Entities::hex=0;

my $unicode=join("",map({chr} 0x100 .. 0x2FA1F));
my $file = shift || '/dev/stdin';

open(my $in, "<", $file)
    or die("Cannot open '$file' : $!\n");

while (my $line = <$in>) {
    print encode_entities($line, $unicode);
}

close($in);

exit(0);

Last edited by Turbocapitalist; 11-16-2017 at 03:28 PM.
 
Old 11-17-2017, 09:56 PM   #9
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,328

Original Poster
Rep: Reputation: 223Reputation: 223Reputation: 223
Quote:
Too late for "-CSD" option at bin/TestIt line 1.
26 seconds, which is a lot faster than the 255 the previous version took, but still a lot more than 1. Thanks for trying. https://www.nytimes.com/2017/11/16/b...me-warner.html was the test file
 
Old 11-18-2017, 02:06 PM   #10
Turbocapitalist
Senior Member
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 4,364
Blog Entries: 3

Rep: Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178Reputation: 2178
No problem. Scripts are for a quick solution from the writing perspective not necessarily from the run speed. Can you go into more detail about how you solved it? For some people, C is comfortable and they can write something equivalent quickly.
 
Old 11-18-2017, 02:44 PM   #11
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,328

Original Poster
Rep: Reputation: 223Reputation: 223Reputation: 223
utf8trans translates utf-8 characters according to a table. Each entry in the table has 2 fields, the number of the utf character to replace, written as a hex string (e.g., 2d0a), a tab, the string to translate it into. The Slackware package comes with a translation table that doesn't have all the characters I want to translate. I had added individual characters as I encountered them. I got tired of this. So I wrote a C program that emitted a complete set of entries and appended this to my table. I didn't do this in the first place because I feared that so-large a table would slow utf8trans down. That turned out not to be the case.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to convert a text to utf-8 byran cheung Linux - Newbie 1 03-25-2015 01:01 AM
Forcing utility output in specific locale (LANG=en_US.UTF-8) davidlt Linux - Newbie 1 01-17-2012 06:59 AM
utility to convert a html tutorial to pdf or openoffice P5music Linux - Software 4 11-25-2010 02:06 PM
[SOLVED] How to convert files to UTF-8 webhope Linux - Software 17 05-12-2010 03:46 PM
Convert UTF-8 to wchar_t navinkaus Programming 1 12-21-2008 08:51 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 08:13 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration