Help Reading Data in Hex Editors and Commands

HalfMadDad · 08-27-2016, 08:51 PM

Hi Everyone

I am trying to learn more about unicode. I am trying to work with a file that has IPA(international Phonetic Alphabet). It seems to be a mess with some characters 1 byte ASCII and some two bytes unicode and I am just trying to sort things out.

My understanding is that the beginning of the various unicode sets start with the ASCII characters, so it might just be a case of padding the start of the 1 byte ASCII with zeros to make all characters 2 bytes. For instance:

a in ASCII = 61

a in unicode = 0061

This is just a bit of rambling background, my real question is this.

If I have these unicode characters in a file:
ʌʌʌʌʌʌʌʌʌʌʌ
ʒʒʒʒʒʒʒʒʒʒʒ

Their unicode values are :
u+028C
u+0292

but if I hexdump them or open them in ghex or bless I get this:

0000000 8cca 8cca 8cca 8cca 8cca 8cca 8cca 8cca
0000010 8cca 8cca 8cca ca0a ca92 ca92 ca92 ca92
0000020 ca92 ca92 ca92 ca92 ca92 ca92 2092 0a0a

I am in Canada, I don't know what all the extra ca characters are. Are they my locale? Why would they be there....

Could someone help me figure this out?

Thanks for reading my post-Patrick

syg00 · 08-27-2016, 09:27 PM

Did you bother to search online ?. A quick search got me this - I know naught of the innards of UTF, but that seems a reasonable explanation for such as I.

astrogeek · 08-27-2016, 09:41 PM

The link provided by syg00 is a good, brief history of the development of Unicode and how characters are encoded in the various flavors.

Probably what you are interested in is about mid-page in the section UTF-8.

When looking at UTF-8 as hex values, remember that a single character can be from one to four bytes long - they are not always the same length - that is one of the design goals of UTF-8 unicode encoding, to use the least storage possible.

Bytes that begin with 1 in the high bit (>=8) are unicode multi-byte characters. Characters that begin with 0 in the high bit are ASCII (Unicode). Those that begin with with 1 tell you how many bytes by the left-most four bits: 1100=2-bytes, 1110=3-bytes, 1111=4-bytes. Bytes beginning with 10 are trailing bytes of a multi-byte character, called data bytes.

You can figure it out from there!

One final note: What hardware are you using? Your example indicates that it is big-endian so it is not x86 or x86-64.

ondoho · 08-28-2016, 04:34 AM

halfmaddad, what has hex got to do with unicode?
are you just curious, or why do you need to edit text files with a hex editor??? :scratchhead:

maybe if you tell us what the actual problem is (and not what you think might be an attempt at a solution), we might be able to help.

HalfMadDad · 08-28-2016, 06:15 AM

Thanks very much astrogeek!

This header part is what I was missing. Your post explains it nicely as does this youtube video:

https://www.youtube.com/watch?v=MijmeoH9LT4

Looking at the binary value in a hex editor it now makes perfect sense. If the first 3 digits are 110 there will be a byte to follow and the last 5 bits of the first will be part of the value. This makes sence but it doesn't mean that the first value will match up nicely between the hex editor and a unicode chart.

Have a great day-Patrick

HalfMadDad · 08-28-2016, 06:17 AM

Hi ondoho

I don't need to edit in a hex editor but I felt it was a good way to pick apart low level topics like this, thanks for your post

syg00 · 08-28-2016, 06:45 AM

Quote:

Originally Posted by HalfMadDad

This makes sence but it doesn't mean that the first value will match up nicely between the hex editor and a unicode chart.

Don't neglect the byte reversal in little Endian - I found that article I linked very informative. Also *all* unicode bytes have the high-order bit set - subtract it from the values you see in hexedit.