LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 08-27-2016, 08:51 PM   #1
HalfMadDad
Member
 
Registered: Jun 2010
Location: Canada
Distribution: Slackware, systemd-garbage
Posts: 143

Rep: Reputation: 9
Help Reading Data in Hex Editors and Commands


Hi Everyone

I am trying to learn more about unicode. I am trying to work with a file that has IPA(international Phonetic Alphabet). It seems to be a mess with some characters 1 byte ASCII and some two bytes unicode and I am just trying to sort things out.

My understanding is that the beginning of the various unicode sets start with the ASCII characters, so it might just be a case of padding the start of the 1 byte ASCII with zeros to make all characters 2 bytes. For instance:

a in ASCII = 61

a in unicode = 0061

This is just a bit of rambling background, my real question is this.

If I have these unicode characters in a file:
ʌʌʌʌʌʌʌʌʌʌʌ
ʒʒʒʒʒʒʒʒʒʒʒ

Their unicode values are :
u+028C
u+0292

but if I hexdump them or open them in ghex or bless I get this:

0000000 8cca 8cca 8cca 8cca 8cca 8cca 8cca 8cca
0000010 8cca 8cca 8cca ca0a ca92 ca92 ca92 ca92
0000020 ca92 ca92 ca92 ca92 ca92 ca92 2092 0a0a

I am in Canada, I don't know what all the extra ca characters are. Are they my locale? Why would they be there....

Could someone help me figure this out?

Thanks for reading my post-Patrick

Last edited by HalfMadDad; 08-27-2016 at 09:20 PM.
 
Old 08-27-2016, 09:27 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,133

Rep: Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121
Did you bother to search online ?. A quick search got me this - I know naught of the innards of UTF, but that seems a reasonable explanation for such as I.
 
1 members found this post helpful.
Old 08-27-2016, 09:41 PM   #3
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,269
Blog Entries: 24

Rep: Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196Reputation: 4196
The link provided by syg00 is a good, brief history of the development of Unicode and how characters are encoded in the various flavors.

Probably what you are interested in is about mid-page in the section UTF-8.

When looking at UTF-8 as hex values, remember that a single character can be from one to four bytes long - they are not always the same length - that is one of the design goals of UTF-8 unicode encoding, to use the least storage possible.

Bytes that begin with 1 in the high bit (>=8) are unicode multi-byte characters. Characters that begin with 0 in the high bit are ASCII (Unicode). Those that begin with with 1 tell you how many bytes by the left-most four bits: 1100=2-bytes, 1110=3-bytes, 1111=4-bytes. Bytes beginning with 10 are trailing bytes of a multi-byte character, called data bytes.

You can figure it out from there!

One final note: What hardware are you using? Your example indicates that it is big-endian so it is not x86 or x86-64.

Last edited by astrogeek; 08-27-2016 at 10:00 PM.
 
1 members found this post helpful.
Old 08-28-2016, 04:34 AM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
halfmaddad, what has hex got to do with unicode?
are you just curious, or why do you need to edit text files with a hex editor??? :scratchhead:

maybe if you tell us what the actual problem is (and not what you think might be an attempt at a solution), we might be able to help.
 
Old 08-28-2016, 06:15 AM   #5
HalfMadDad
Member
 
Registered: Jun 2010
Location: Canada
Distribution: Slackware, systemd-garbage
Posts: 143

Original Poster
Rep: Reputation: 9
Thanks very much astrogeek!

This header part is what I was missing. Your post explains it nicely as does this youtube video:

https://www.youtube.com/watch?v=MijmeoH9LT4

Looking at the binary value in a hex editor it now makes perfect sense. If the first 3 digits are 110 there will be a byte to follow and the last 5 bits of the first will be part of the value. This makes sence but it doesn't mean that the first value will match up nicely between the hex editor and a unicode chart.

Have a great day-Patrick
 
Old 08-28-2016, 06:17 AM   #6
HalfMadDad
Member
 
Registered: Jun 2010
Location: Canada
Distribution: Slackware, systemd-garbage
Posts: 143

Original Poster
Rep: Reputation: 9
Hi ondoho

I don't need to edit in a hex editor but I felt it was a good way to pick apart low level topics like this, thanks for your post
 
Old 08-28-2016, 06:45 AM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,133

Rep: Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121Reputation: 4121
Quote:
Originally Posted by HalfMadDad View Post
This makes sence but it doesn't mean that the first value will match up nicely between the hex editor and a unicode chart.
Don't neglect the byte reversal in little Endian - I found that article I linked very informative. Also *all* unicode bytes have the high-order bit set - subtract it from the values you see in hexedit.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Are there any hex editors that support custom table files? athenian200 Linux - Software 2 12-18-2015 10:17 PM
[SOLVED] Software hell; does linux have decent hex editors? (moving file to Win to edit suks) Master_CAPS Linux - Software 5 10-19-2012 09:16 PM
Linux image editors, video editors, audio editors, designing programs? vieya Linux - Software 3 12-06-2009 10:02 AM
hex editors for Linux pixellany Linux - Software 4 01-31-2006 11:06 PM
Compatibility amongst hex editors... koyi Linux - Software 2 12-20-2004 04:56 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 07:17 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration