Confused with the diacritic characters

kshkid · 03-09-2007, 12:27 AM

Hi All,

Am really confused with the diacritic characters display.

Here is my problem!

Code:

Â, ==> which is a  LATIN CAPITAL LETTER A WITH CIRCUMFLEX

I have many characters like the above in a huge file.

Am trying to find out the accompanying characters along with the special diacritic character like.

Code:

Âe
Âu
Âi

similiar to that,

When I paste a specific portion of the file alone in a small file called 'sample'

am getting the value of Â ==> as "195 130" which is the exact value

When am reading from the huge file I get the value as "194" which is incorrect

Actually these characters are en-coded in 2 bytes, hence "195 130" should be displayed even when read from huge file as well.

Heres my script!

Code:

#! /opt/third-party/bin/perl

use IO::File;

my $filename = "/datarepos/hugefile";
$source = IO::File->new($filename, 'r');

my $ord;
binmode($source);

while (1) {
  $ch = getc($source);
  print "ch is:$ch\n";
  $ord = ord($ch);
  if( $ord == 0 ) {
    last;
  }
  print "ordval is:" . $ord . "\n";
}

exit 0

For a small file "sample" consisting of just

Code:

CondÂe-sur

For the above small extract it works, fine.

Am puzzled about this behaviour!! ???

Many thanks in advance!

jlliagre · 03-09-2007, 02:03 AM

My first guess is these characters are coded in UTF-8 (or perhaps ANSEL) while your terminal is set to use ISO-8859.

kshkid · 03-09-2007, 03:07 AM

Yes, these are unicode characters coded in utf-8 encoding format

But, I just set the terminal to vt100 and nothing else

what is the setting that I need to make to remove this difference ?

Many thanks for the reply!

jlliagre · 03-09-2007, 05:29 AM

The encoding used by your terminal emulator is the one that was set as with an environment variable when you launched it.

You can run an UTF-8 xterm like this:

Code:

LC_ALL=en_US.UTF-8 xterm&

Alternatively, you can convert your file from UTF-8 to ISO-8859 (check the syntax as I'm using Solaris and it may be different with your O/S):

Code:

iconv -f UTF-8 -t 8859-1 file.utf8 > file.iso

kshkid · 03-09-2007, 07:51 AM

I should have included this information before.

OS : RHEL3
Shell: zsh

Actually, the encoding set am working is not in pure format, which should be in UNIMARC; was badly corrupted, and am trying to fix that.

Before that I just wanted to extract the value and know their 2-byte encoded value and that is where the problem landed up in a confusing way.

I tried your solution to set up LC_ALL and to extract the value directly from the huge file, but unfortunately that didnt work,

since the encoding format am working is UNIMARC specific which needs to be converted to UTF-8

but iconv -l doesnt support that convertion, there is no way except for me to write my own converter for that.

If you could shed upon some light that would be great!

Many thanks for your reply!

graemef · 03-09-2007, 11:46 PM

Quote:

Originally Posted by kshkid

When I paste a specific portion of the file alone in a small file called 'sample'

am getting the value of Â ==> as "195 130" which is the exact value

When am reading from the huge file I get the value as "194" which is incorrect

Actually these characters are en-coded in 2 bytes, hence "195 130" should be displayed even when read from huge file as well.

In UTF-8 Â will have the code C382(195,130)
In ISO-8859-1 Â will have the code C2 (194)

So it looks as if your file is holding the characters in single byte format.

By the way UTF-8 C382 will convert to U+00C2, so it could be in Unicode code point?

kshkid · 03-10-2007, 11:49 PM

Quote:

By the way UTF-8 C382 will convert to U+00C2, so it could be in Unicode code point?

Question is I really dont know, but how to verify them?

Exactly the following is the unicode pt for the diacritic character in discussion

Code:

U+00C2	Â	195 130	LATIN CAPITAL LETTER A WITH CIRCUMFLEX

Quote:

So it looks as if your file is holding the characters in single byte format.

How do I verify this?

Quote:

In UTF-8 Â will have the code C382(195,130)
In ISO-8859-1 Â will have the code C2 (194)

I could understand Unicode Code point, but what about these C382 and C2 values?

graemef · 03-11-2007, 12:29 AM

Well C382 is hex of (195, 130) to convert from UTF-8 to Unicode code point you first need to know the number of bytes, in this case 2, look at it in binary 11000011 10000010 in two byte UTF-8 this is converted using the following pattern:

UTF-8 110yyyyy 10zzzzzz
UNICODE yyyyyzzzzzz

Thus 11000011 10000010 will convert to
yyyyyzzzzzz
00011000010 = 0xC2 = 194

My guess is that the original file is ISO-8859-1, open it with that encoding and see what it looks like.

kshkid · 03-11-2007, 01:38 AM

Many thanks again for the reply !

That helped me a lot!

Basically there isnt any difference in the display of the specific character, either through LC_ALL=en_US.ISO-8859-1 (latin-1)
or
LC_ALL=en_US.UTF8 (utf-8)

Whats puzzling me is, if am copying a specific portion of the file inclusive of the diacritic characters, how is that converted to utf8 format from latin1 ? (from the value of c382 ==> 195, 130)

And one more confirmation,

utf-8
latin-1

are encoding character sets

and Unicode code point, is a code point given to any character belonging to any of the encoding character sets.

So, basically, UCS - Universal Character Set comprises all the encoding character sets. ( Therefore all the encoding character sets can be represented in UCS )

Please do correct me, if am wrong?

Many thanks again!

graemef · 03-11-2007, 07:40 AM

The idea of Unicode is to have a scheme that can hold all known character sets, this fits into 4 bytes, which allows for a huge number of characters. There would be a lot of wastage if all documents were always held using 4 bytes, so different schemes have been developed to address this issue.

UTF-8 will hold the traditional ASCII (127 bits) in a single byte, then come the more common character sets, European etc which fit into two bytes, then less popular scripts into 3-bytes (for example the tibetan script is 3 bytes) Then the chinese script (I think because of it's shear size) is placed in the 4-byte region. With this scheme comes a cost, in that bits are reserved to identify if the character is one two three bytes or more. If the left most bit is a zero then it is a single byte character. For two byte characters five bits are reserved 110 on the first byte and then 10 on the next byte. For three byte characters I think that it is eight bits that are required 1110 for the first byte and then 10 for the subsequent bytes.

ASCII and ISO-8859-1 predate UNICODE but there is a certain amount of backwards compatability built in which is useful (especially with pure ASCII) but can also be confusing!