Help Reading Data in Hex Editors and Commands
Hi Everyone
I am trying to learn more about unicode. I am trying to work with a file that has IPA(international Phonetic Alphabet). It seems to be a mess with some characters 1 byte ASCII and some two bytes unicode and I am just trying to sort things out. My understanding is that the beginning of the various unicode sets start with the ASCII characters, so it might just be a case of padding the start of the 1 byte ASCII with zeros to make all characters 2 bytes. For instance: a in ASCII = 61 a in unicode = 0061 This is just a bit of rambling background, my real question is this. If I have these unicode characters in a file: ʌʌʌʌʌʌʌʌʌʌʌ ʒʒʒʒʒʒʒʒʒʒʒ Their unicode values are : u+028C u+0292 but if I hexdump them or open them in ghex or bless I get this: 0000000 8cca 8cca 8cca 8cca 8cca 8cca 8cca 8cca 0000010 8cca 8cca 8cca ca0a ca92 ca92 ca92 ca92 0000020 ca92 ca92 ca92 ca92 ca92 ca92 2092 0a0a I am in Canada, I don't know what all the extra ca characters are. Are they my locale? Why would they be there.... Could someone help me figure this out? Thanks for reading my post-Patrick |
Did you bother to search online ?. A quick search got me this - I know naught of the innards of UTF, but that seems a reasonable explanation for such as I.
|
The link provided by syg00 is a good, brief history of the development of Unicode and how characters are encoded in the various flavors.
Probably what you are interested in is about mid-page in the section UTF-8. When looking at UTF-8 as hex values, remember that a single character can be from one to four bytes long - they are not always the same length - that is one of the design goals of UTF-8 unicode encoding, to use the least storage possible. Bytes that begin with 1 in the high bit (>=8) are unicode multi-byte characters. Characters that begin with 0 in the high bit are ASCII (Unicode). Those that begin with with 1 tell you how many bytes by the left-most four bits: 1100=2-bytes, 1110=3-bytes, 1111=4-bytes. Bytes beginning with 10 are trailing bytes of a multi-byte character, called data bytes. You can figure it out from there! One final note: What hardware are you using? Your example indicates that it is big-endian so it is not x86 or x86-64. |
halfmaddad, what has hex got to do with unicode?
are you just curious, or why do you need to edit text files with a hex editor??? :scratchhead: maybe if you tell us what the actual problem is (and not what you think might be an attempt at a solution), we might be able to help. |
Thanks very much astrogeek!
This header part is what I was missing. Your post explains it nicely as does this youtube video: https://www.youtube.com/watch?v=MijmeoH9LT4 Looking at the binary value in a hex editor it now makes perfect sense. If the first 3 digits are 110 there will be a byte to follow and the last 5 bits of the first will be part of the value. This makes sence but it doesn't mean that the first value will match up nicely between the hex editor and a unicode chart. Have a great day-Patrick |
Hi ondoho
I don't need to edit in a hex editor but I felt it was a good way to pick apart low level topics like this, thanks for your post |
Quote:
|
All times are GMT -5. The time now is 10:21 AM. |