Hello gurus, I would like to get deep into charset and encoding isse, also tried google it but no luck. Please see bellow
My configuration
Code:
[pista@HP-PC MULTIBOOT]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
I have file1, containing text. This text I am able to see correctly only on M$ windows, If i just open the file with less, cat or vi I get this:
Code:
[pista@HP-PC konvertovanie]$ cat file1
- Prich�dzaj�.
- Kto prich�dza?
N�� svet okupuj
vyvinut� �udsk� druhy,
[pista@HP-PC konvertovanie]$ less file1
- Prich<E1>dzaj<FA>.
- Kto prich<E1>dza?
N<E1><9A> svet okupuj<FA>
vyvinut<E9> <BE>udsk<E9> druhy,
[pista@HP-PC konvertovanie]$ vi file1
- Prichádzajú.
- Kto prichádza?
Ná<9a> svet okupujú
vyvinuté ľudské druhy,
Under linux I have to use iconv to see it correctly
Code:
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1
- Prichádzajú.
- Kto prichádza?
Náš svet okupujú
vyvinuté ľudské druhy,
I understand that this is because of that file was coded in one format (WINDOWS-1250) and encoded in another (UTF-8). But can you clarify the following?
1.) When I check the decimal ASCII value of each character I get following lines. So what does negative values mean and what is that code 341 (instead of á) ? AFAIK ASCII is from 0-127.
Code:
[pista@HP-PC konvertovanie]$ cat file1 | od -An -t dC -c
45 32 80 114 105 99 104 -31 100 122 97 106 -6 46 13 10
- P r i c h 341 d z a j 372 . \r \n
45 32 75 116 111 32 112 114 105 99 104 -31 100 122 97 63
- K t o p r i c h 341 d z a ?
13 10 78 -31 -102 32 115 118 101 116 32 111 107 117 112 117
\r \n N 341 232 s v e t o k u p u
106 -6 13 10 48 48 58 48 48 58 48 53 44 56 50 48
j 372 \r \n 0 0 : 0 0 : 0 5 , 8 2 0
32 45 45 62 32 48 48 58 48 48 58 48 55 44 54 53
- - > 0 0 : 0 0 : 0 7 , 6 5
52 13 10 118 121 118 105 110 117 116 -23 32 -66 117 100 115
4 \r \n v y v i n u t 351 276 u d s
107 -23 32 100 114 117 104 121 44 13 10
k 351 d r u h y , \r \n
2.) My assumption is that if UTF-8 and WINDOWS-1250 uses for same characters different "numbers" (code representation) then if some character will be encoded using encoding1 (WINDOWS-1250) it gains approporiate "code1" from encoding1 table. So if this encoded character (or more likely it's number representation, which is "code1") will be decoded using another encoding (UTF-8) the only thing that happens here is that for "code1" there will be lookup in encoding2 (UTF-8) table and approporiate character from encoding2 table is asigned, am I right ? I think after some exaple it will be clear:
Please look at following sites, they shows what will happend if you encode with one encoding and decode with another. Seems that until you reach 127 (decimal) boundary no mather if you decode with wrong decoding (this is why some characters in above example was displayed correctly even when wrong encoding was used).
from UTF-8 to WINDOWS-1250
http://www.string-functions.com/enco...&decoding=1250
from WINDOWS-1250 to UTF-8
http://www.string-functions.com/enco...decoding=65001
According this site
http://doc.infosnel.nl/extreme_utf-8.html the "á" character is encoded in UTF-8 as a 225. According wikipedia
http://en.wikipedia.org/wiki/Windows-1250 "á" has also value 225 in Windows-1250. So why is "á" not dispplayed correctly even if I use wrong encoding, check here and type "á"
http://www.string-functions.com/encodedecode.aspx ? Also some interesting observation, in UTF-8 table there is "š" character two times (one time with 154 and another with 453 code) why ?
3.) If i understand it right there is no way to tell how file was encoded (unless there is some header that specify this, or you do some statistical language analysis etc.). So why/how "file" commands recognize UTF-8 encoding but not WINDOWS-1250 ?
Code:
[pista@HP-PC konvertovanie]$ file -bi file1
text/plain; charset=unknown-8bit
[pista@HP-PC konvertovanie]$ iconv -f WINDOWS-1250 -t UTF-8 file1 > file1.utf8
[pista@HP-PC konvertovanie]$ file -bi file1.utf8
text/plain; charset=utf-8
Thank you very much