LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > General
User Name
Password
General This forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!

Notices

Reply
 
Search this Thread
Old 11-14-2012, 05:11 AM   #1
wakatana
Member
 
Registered: Jul 2009
Location: Slovakia
Posts: 133

Rep: Reputation: 16
Defininitive guide to encoding


Hello gurus, I would like to get deep into charset and encoding isse, also tried google it but no luck. Please see bellow

My configuration
Code:
[pista@HP-PC MULTIBOOT]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I have file1, containing text. This text I am able to see correctly only on M$ windows, If i just open the file with less, cat or vi I get this:
Code:
[pista@HP-PC konvertovanie]$ cat file1 
- Prich�dzaj�.
- Kto prich�dza?
N�� svet okupuj
vyvinut� �udsk� druhy,

[pista@HP-PC konvertovanie]$ less file1 
- Prich<E1>dzaj<FA>.
- Kto prich<E1>dza?
N<E1><9A> svet okupuj<FA>
vyvinut<E9> <BE>udsk<E9> druhy,

[pista@HP-PC konvertovanie]$ vi file1 
- Prichádzajú.
- Kto prichádza?
Ná<9a> svet okupujú
vyvinuté ľudské druhy,
Under linux I have to use iconv to see it correctly
Code:
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1
- Prichádzajú.
- Kto prichádza?
Náš svet okupujú
vyvinuté ľudské druhy,
I understand that this is because of that file was coded in one format (WINDOWS-1250) and encoded in another (UTF-8). But can you clarify the following?

1.) When I check the decimal ASCII value of each character I get following lines. So what does negative values mean and what is that code 341 (instead of á) ? AFAIK ASCII is from 0-127.
Code:
[pista@HP-PC konvertovanie]$ cat file1 | od -An -t dC -c
   45   32   80  114  105   99  104  -31  100  122   97  106   -6   46   13   10
    -         P    r    i    c    h  341    d    z    a    j  372    .   \r   \n
   45   32   75  116  111   32  112  114  105   99  104  -31  100  122   97   63
    -         K    t    o         p    r    i    c    h  341    d    z    a    ?
   13   10   78  -31 -102   32  115  118  101  116   32  111  107  117  112  117
   \r   \n    N  341  232         s    v    e    t         o    k    u    p    u
  106   -6   13   10   48   48   58   48   48   58   48   53   44   56   50   48
    j  372   \r   \n    0    0    :    0    0    :    0    5    ,    8    2    0
   32   45   45   62   32   48   48   58   48   48   58   48   55   44   54   53
         -    -    >         0    0    :    0    0    :    0    7    ,    6    5
   52   13   10  118  121  118  105  110  117  116  -23   32  -66  117  100  115
    4   \r   \n    v    y    v    i    n    u    t  351       276    u    d    s
  107  -23   32  100  114  117  104  121   44   13   10
    k  351         d    r    u    h    y    ,   \r   \n
2.) My assumption is that if UTF-8 and WINDOWS-1250 uses for same characters different "numbers" (code representation) then if some character will be encoded using encoding1 (WINDOWS-1250) it gains approporiate "code1" from encoding1 table. So if this encoded character (or more likely it's number representation, which is "code1") will be decoded using another encoding (UTF-8) the only thing that happens here is that for "code1" there will be lookup in encoding2 (UTF-8) table and approporiate character from encoding2 table is asigned, am I right ? I think after some exaple it will be clear:

Please look at following sites, they shows what will happend if you encode with one encoding and decode with another. Seems that until you reach 127 (decimal) boundary no mather if you decode with wrong decoding (this is why some characters in above example was displayed correctly even when wrong encoding was used).

from UTF-8 to WINDOWS-1250
http://www.string-functions.com/enco...&decoding=1250

from WINDOWS-1250 to UTF-8
http://www.string-functions.com/enco...decoding=65001

According this site http://doc.infosnel.nl/extreme_utf-8.html the "á" character is encoded in UTF-8 as a 225. According wikipedia http://en.wikipedia.org/wiki/Windows-1250 "á" has also value 225 in Windows-1250. So why is "á" not dispplayed correctly even if I use wrong encoding, check here and type "á" http://www.string-functions.com/encodedecode.aspx ? Also some interesting observation, in UTF-8 table there is "š" character two times (one time with 154 and another with 453 code) why ?

3.) If i understand it right there is no way to tell how file was encoded (unless there is some header that specify this, or you do some statistical language analysis etc.). So why/how "file" commands recognize UTF-8 encoding but not WINDOWS-1250 ?
Code:
[pista@HP-PC konvertovanie]$ file -bi file1 
text/plain; charset=unknown-8bit
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1 > file1.utf8
[pista@HP-PC konvertovanie]$ file -bi file1.utf8 
text/plain; charset=utf-8
Thank you very much
 
Old 11-14-2012, 09:15 AM   #2
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,324

Rep: Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099
The thing that you're looking for will be referred to as locale. This is the set of information that "describes where-in-the-world you are located." It affects not only character representation but also things like commas vs. periods in numbers.

When you "display a file in less, cat, or vi," and see garbage characters, the actual problem is that bash does not know what character-set table to use: how to interpret this stream-of-bytes as a stream-of-characters and to draw the right symbols on the screen. There are (of course) different ways to do this, but most apps will look-for and understand the locale setting, at least as their source of defaults.

Web-pages are supposed to identify the character-set that's been used to encode their content, but once again there will be a default that the browser will look-for and/or allow you to set.

No, there isn't a definitive way for the computer to examine the binary contents of a file and to somehow deduce what character-encoding has been used. It certainly wouldn't be expected to know "American English" vs. "British English," or whether 12,345 is twelve-thousand or twelve-point-three. This is purely contextual information, which must be provided. The locale concept is how this is done.

Last edited by sundialsvcs; 11-14-2012 at 09:18 AM.
 
Old 11-14-2012, 09:17 AM   #3
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 618

Rep: Reputation: 136Reputation: 136
"file" uses magic numbers from /usr/share/magic for redhat or http://linux.die.net/man/1/file. So it probably doesn't have the appropriate entries for WINDOZE-1250 whereas iconv probably has it.

Its actually more complex than just the magic numbers and the details are in the URL mentioned above.

OK
 
Old 11-14-2012, 11:28 AM   #4
DavidMcCann
Senior Member
 
Registered: Jul 2006
Location: London
Distribution: CentOS, Salix
Posts: 3,008

Rep: Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774Reputation: 774
There's something odd going on here. If the original file were encoded in Windows 1250, "á" would display correctly but "ľ" as "ľ"!

The question mark is used to show that the character number is meaningless in Unicode, like the negative numbers you have. I suspect the original was in the Windows implementation of Unicode, UTF16, and something is going wrong as it's converted to UTF8, generating those negative numbers.
 
Old 11-14-2012, 10:25 PM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,426

Rep: Reputation: 821Reputation: 821Reputation: 821Reputation: 821Reputation: 821Reputation: 821Reputation: 821
Quote:
Originally Posted by wakatana View Post
1.) When I check the decimal ASCII value of each character I get following lines. So what does negative values mean and what is that code 341 (instead of á) ? AFAIK ASCII is from 0-127.
ASCII is indeed from 0-127, but od looks at a byte at a time and a byte is from 0-255, or from -128 to 127 if you use a signed interpretation. 341 is the unsigned interpretation in octal: 341octal = 225decimal; 225 - 256 = -31 in the signed interpretation.
Quote:
2.) My assumption is that if UTF-8 and WINDOWS-1250 uses for same characters different "numbers" (code representation) then if some character will be encoded using encoding1 (WINDOWS-1250) it gains approporiate "code1" from encoding1 table. So if this encoded character (or more likely it's number representation, which is "code1") will be decoded using another encoding (UTF-8) the only thing that happens here is that for "code1" there will be lookup in encoding2 (UTF-8) table and approporiate character from encoding2 table is asigned, am I right ?
It's not quite that simple. The character-to-number mapping is different between UTF-8 and WINDOWS-1250, but UTF-8 also encodes the numbers differently. WINDOWS-1250 has a much smaller range of characters so it can use a simple one-byte-is-one-character encoding. UTF-8 has to encode the whole Unicode character set, so it can't simple encode every character in a byte. The codepoints from 0-127 are encoded in 1 byte (to have compatibility with ASCII), but higher codepoints take multiple bytes.
 
Old 11-15-2012, 08:31 AM   #6
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,324

Rep: Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099Reputation: 1099
Quote:
Originally Posted by AnanthaP View Post
"file" uses magic numbers from /usr/share/magic for redhat or http://linux.die.net/man/1/file. So it probably doesn't have the appropriate entries for WINDOZE-1250 whereas iconv probably has it.

Its actually more complex than just the magic numbers and the details are in the URL mentioned above.
This command generally refers to the known "magic number" characteristics of, say, executables and images and library files of various sorts. It is a much more difficult, nay, intractable problem to figure out what human language a text file might use. You recognize, yes, that there is a non-ASCII character sequence within the first few bytes of the file (too-bad if there are none ...), but what is the correct interpretation of 0x010203? You can only know this if the file tells you, or if a system (locale...) gives you an assumption that you can use.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Chinese encoding not encoding in kate linuxmandrake Linux - Software 1 12-12-2010 08:50 AM
vi encoding russiantux Programming 1 01-26-2006 11:59 AM
Other than US encoding kornerr Linux - General 4 01-21-2005 09:42 AM


All times are GMT -5. The time now is 03:51 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration