[SOLVED] Searching a doc. for UTF-8 hex instances and converting

gacl · 11-17-2019, 10:24 PM

I'm trying to write a script that searches a document for UTF-8 hex and converts each instance found to its corresponding character. Does this exists already in Bash or, more broadly, in Linux?

If it doesn't, is there a library that I can use or a CSV file that I can use as a database to feed to sed?

Thanks.

NevemTeve · 11-17-2019, 10:38 PM

What an 'UTF8 hex' is? Please give an example for input and output file.

rtmistler · 11-18-2019, 07:21 AM

Quote:

Originally Posted by NevemTeve

What an 'UTF8 hex' is? Please give an example for input and output file.

https://en.wikipedia.org/wiki/UTF-8

Quote:

Originally Posted by gacl

I'm trying to write a script that searches a document for UTF-8 hex and converts each instance found to its corresponding character. Does this exists already in Bash or, more broadly, in Linux?

If it doesn't, is there a library that I can use or a CSV file that I can use as a database to feed to sed?

I usually do use sed and awk to perform text processing, also there is the tr command which I've not used much. And then of course you can feed them into a script. What I believe you have to do is to tell your environment that you are coding in UTF-8. Sorry you'll have to search for things like "using UTF-8 and sed" or awk, or tr, I've not done this type of conversion.

NevemTeve · 11-18-2019, 10:50 AM

Hi, if you know what OP means under "utf8 hex" please give some examples.

michaelk · 11-18-2019, 11:29 AM

Yes they exist in linux and are called hex editors. I'm not sure which ones are capable of UTF-8 but there are both command line and GUI programs. There are lots of hex editors and some examples are:
Command line
hexedit
xxd

GUI
wxMEdit
ghex
bless

vim has the capability of displaying a file as hex code.

If you look at the posted link there is a table that shows the hex code versus character. Since UTF-8 is backwards compatible with standard ASCII the letter A is represented by X0041.

astrogeek · 11-18-2019, 11:52 AM

It would be helpful if the OP could provide an actual example of the file they wish to convert.

It is not clear to me whether the source file is unicode, or perhaps ascii with unicode characters represented as hex characters, or a hex dump of a unicode or ascii file, formatted or raw. For example, I can imagine any of the following fitting the vague description given:

Code:

Some text including some unicode characters: ± « §.

Some text including some unicode character hex values: c2b1 c2ab c2a7.

00000000: 536f 6d65 2074 6578 7420 696e 636c 7564  Some text includ
00000010: 696e 6720 736f 6d65 2075 6e69 636f 6465  ing some unicode
00000020: 2063 6861 7261 6374 6572 733a 20c2 b120   characters: ..
00000030: c2ab 20c2 a72e 0a                        .. ....

536f6d65207465787420696e636c7564696e6720736f6d6520756e69636f
646520636861726163746572733a20c2b120c2ab20c2a72e0a

gacl · 11-24-2019, 12:09 AM

Basically, I'm getting text files already encoded in UTF-8 that look like this:

Code:

Destino
-------
Alcal%C3%A1 de Henares
Legan%C3%9s
San Sebasti%C3%A1n
Getafe
Burgos

When it should be this:

Code:

Destino
-------
Alcalá de Henares
Leganés
San Sebastián
Getafe
Burgos

I suppose that the file was UTF-8, at some point it was converted to ASCII, and then back to UTF-8(?)

NevemTeve · 11-24-2019, 04:21 AM

This is 'urlencoded' (not entirely though)
https://www.php.net/manual/en/function.rawurldecode.php

astrogeek · 11-24-2019, 06:22 PM

NevemTeve appears to be right, that is not UTF-8 or simple ASCII, it is urlencoded.

PHP's urldecode() is probably the easiest single solution, but I found several interesting approaches here and here, although some have a few caveats on use, mostly related to '+' signs for spaces and text with embedded backslashes.

See if you can find one of those to meet your needs and let us know if you need more help!

NevemTeve · 11-25-2019, 04:07 AM

Google offered a simple Perl solution, I made it to a litte script (reads stdin, writes stdout):

Code:

#!/usr/local/bin/perl

sub url_decode {
    my $rv = $_[0];
    $rv =~ s/\%([a-f\d]{2})/ pack 'C', hex $1 /geix;
    return $rv;
}

while (<>) {
    print url_decode ($_);
}

test-run:

Code:

echo '%c3%a1%72%76%c3%ad%7a%74%c5%b1%72%c5%91
%74%c3%bc%6b%c3%b6%72%66%c3%ba%72%c3%b3%67%c3%a9%70' |
perl rawurldecode.pl

árvíztűrő
tükörfúrógép

gacl · 11-29-2019, 07:20 PM

The perl function works. Thanks.

Do pack and hex work because of the e flag?

NevemTeve · 11-29-2019, 10:01 PM

Yes, 'e' means 'evaluate'.
https://perldoc.perl.org/perlre.html#Modifiers

gacl · 12-01-2019, 10:05 PM

Thanks for introducing me to Perl. I like it.

Problem solved!