LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-17-2019, 10:24 PM   #1
gacl
Member
 
Registered: Feb 2004
Posts: 64

Rep: Reputation: 23
Searching a doc. for UTF-8 hex instances and converting


I'm trying to write a script that searches a document for UTF-8 hex and converts each instance found to its corresponding character. Does this exists already in Bash or, more broadly, in Linux?

If it doesn't, is there a library that I can use or a CSV file that I can use as a database to feed to sed?

Thanks.
 
Old 11-17-2019, 10:38 PM   #2
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
What an 'UTF8 hex' is? Please give an example for input and output file.
 
Old 11-18-2019, 07:21 AM   #3
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Quote:
Originally Posted by NevemTeve View Post
What an 'UTF8 hex' is? Please give an example for input and output file.
https://en.wikipedia.org/wiki/UTF-8
Quote:
Originally Posted by gacl View Post
I'm trying to write a script that searches a document for UTF-8 hex and converts each instance found to its corresponding character. Does this exists already in Bash or, more broadly, in Linux?

If it doesn't, is there a library that I can use or a CSV file that I can use as a database to feed to sed?
I usually do use sed and awk to perform text processing, also there is the tr command which I've not used much. And then of course you can feed them into a script. What I believe you have to do is to tell your environment that you are coding in UTF-8. Sorry you'll have to search for things like "using UTF-8 and sed" or awk, or tr, I've not done this type of conversion.
 
Old 11-18-2019, 10:50 AM   #4
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Hi, if you know what OP means under "utf8 hex" please give some examples.
 
Old 11-18-2019, 11:29 AM   #5
michaelk
Moderator
 
Registered: Aug 2002
Posts: 25,700

Rep: Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895Reputation: 5895
Yes they exist in linux and are called hex editors. I'm not sure which ones are capable of UTF-8 but there are both command line and GUI programs. There are lots of hex editors and some examples are:
Command line
hexedit
xxd

GUI
wxMEdit
ghex
bless

vim has the capability of displaying a file as hex code.

If you look at the posted link there is a table that shows the hex code versus character. Since UTF-8 is backwards compatible with standard ASCII the letter A is represented by X0041.

Last edited by michaelk; 11-18-2019 at 11:32 AM.
 
1 members found this post helpful.
Old 11-18-2019, 11:52 AM   #6
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,263
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
It would be helpful if the OP could provide an actual example of the file they wish to convert.

It is not clear to me whether the source file is unicode, or perhaps ascii with unicode characters represented as hex characters, or a hex dump of a unicode or ascii file, formatted or raw. For example, I can imagine any of the following fitting the vague description given:

Code:
Some text including some unicode characters: ± « §.

Some text including some unicode character hex values: c2b1 c2ab c2a7.

00000000: 536f 6d65 2074 6578 7420 696e 636c 7564  Some text includ
00000010: 696e 6720 736f 6d65 2075 6e69 636f 6465  ing some unicode
00000020: 2063 6861 7261 6374 6572 733a 20c2 b120   characters: ..
00000030: c2ab 20c2 a72e 0a                        .. ....

536f6d65207465787420696e636c7564696e6720736f6d6520756e69636f
646520636861726163746572733a20c2b120c2ab20c2a72e0a
 
2 members found this post helpful.
Old 11-24-2019, 12:09 AM   #7
gacl
Member
 
Registered: Feb 2004
Posts: 64

Original Poster
Rep: Reputation: 23
Basically, I'm getting text files already encoded in UTF-8 that look like this:
Code:
Destino
-------
Alcal%C3%A1 de Henares
Legan%C3%9s
San Sebasti%C3%A1n
Getafe
Burgos
When it should be this:
Code:
Destino
-------
Alcalá de Henares
Leganés
San Sebastián
Getafe
Burgos
I suppose that the file was UTF-8, at some point it was converted to ASCII, and then back to UTF-8(?)
 
Old 11-24-2019, 04:21 AM   #8
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
This is 'urlencoded' (not entirely though)
https://www.php.net/manual/en/function.rawurldecode.php

Last edited by NevemTeve; 11-24-2019 at 04:32 AM.
 
1 members found this post helpful.
Old 11-24-2019, 06:22 PM   #9
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,263
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
NevemTeve appears to be right, that is not UTF-8 or simple ASCII, it is urlencoded.

PHP's urldecode() is probably the easiest single solution, but I found several interesting approaches here and here, although some have a few caveats on use, mostly related to '+' signs for spaces and text with embedded backslashes.

See if you can find one of those to meet your needs and let us know if you need more help!

Last edited by astrogeek; 11-24-2019 at 06:24 PM.
 
1 members found this post helpful.
Old 11-25-2019, 04:07 AM   #10
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Google offered a simple Perl solution, I made it to a litte script (reads stdin, writes stdout):
Code:
#!/usr/local/bin/perl

sub url_decode {
    my $rv = $_[0];
    $rv =~ s/\%([a-f\d]{2})/ pack 'C', hex $1 /geix;
    return $rv;
}

while (<>) {
    print url_decode ($_);
}
test-run:
Code:
echo '%c3%a1%72%76%c3%ad%7a%74%c5%b1%72%c5%91
%74%c3%bc%6b%c3%b6%72%66%c3%ba%72%c3%b3%67%c3%a9%70' |
perl rawurldecode.pl

árvíztűrő
tükörfúrógép

Last edited by NevemTeve; 11-25-2019 at 04:10 AM.
 
3 members found this post helpful.
Old 11-29-2019, 07:20 PM   #11
gacl
Member
 
Registered: Feb 2004
Posts: 64

Original Poster
Rep: Reputation: 23
The perl function works. Thanks.

Do pack and hex work because of the e flag?
 
Old 11-29-2019, 10:01 PM   #12
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,862
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Yes, 'e' means 'evaluate'.
https://perldoc.perl.org/perlre.html#Modifiers
 
Old 12-01-2019, 10:05 PM   #13
gacl
Member
 
Registered: Feb 2004
Posts: 64

Original Poster
Rep: Reputation: 23
Thanks for introducing me to Perl. I like it.

Problem solved!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[bash] ASCII to HEX and hex to ascii ////// Programming 17 05-08-2018 09:55 PM
Converting UTF-16 to UTF-8 Mharris60 Linux - Newbie 1 11-06-2014 08:40 PM
Hex output of a hex/ascii input string mlewis Programming 35 04-10-2008 12:05 PM
Converting UTF-16 files to another encoding (such as UTF-8) crisostomo_enrico Solaris / OpenSolaris 3 03-25-2008 05:30 PM
hex to ascii and ascii to hex ilnli Programming 7 08-31-2007 11:55 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:04 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration