ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Well C382 is hex of (195, 130) to convert from UTF-8 to Unicode code point you first need to know the number of bytes, in this case 2, look at it in binary 11000011 10000010 in two byte UTF-8 this is converted using the following pattern:
UTF-8 110yyyyy 10zzzzzz
UNICODE yyyyyzzzzzz
Thus 11000011 10000010 will convert to
yyyyyzzzzzz
00011000010 = 0xC2 = 194
My guess is that the original file is ISO-8859-1, open it with that encoding and see what it looks like.
Basically there isnt any difference in the display of the specific character, either through LC_ALL=en_US.ISO-8859-1 (latin-1)
or
LC_ALL=en_US.UTF8 (utf-8)
Whats puzzling me is, if am copying a specific portion of the file inclusive of the diacritic characters, how is that converted to utf8 format from latin1 ? (from the value of c382 ==> 195, 130)
And one more confirmation,
utf-8
latin-1
are encoding character sets
and Unicode code point, is a code point given to any character belonging to any of the encoding character sets.
So, basically, UCS - Universal Character Set comprises all the encoding character sets. ( Therefore all the encoding character sets can be represented in UCS )
The idea of Unicode is to have a scheme that can hold all known character sets, this fits into 4 bytes, which allows for a huge number of characters. There would be a lot of wastage if all documents were always held using 4 bytes, so different schemes have been developed to address this issue.
UTF-8 will hold the traditional ASCII (127 bits) in a single byte, then come the more common character sets, European etc which fit into two bytes, then less popular scripts into 3-bytes (for example the tibetan script is 3 bytes) Then the chinese script (I think because of it's shear size) is placed in the 4-byte region. With this scheme comes a cost, in that bits are reserved to identify if the character is one two three bytes or more. If the left most bit is a zero then it is a single byte character. For two byte characters five bits are reserved 110 on the first byte and then 10 on the next byte. For three byte characters I think that it is eight bits that are required 1110 for the first byte and then 10 for the subsequent bytes.
ASCII and ISO-8859-1 predate UNICODE but there is a certain amount of backwards compatability built in which is useful (especially with pure ASCII) but can also be confusing!
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.