LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 03-09-2007, 12:27 AM   #1
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Rep: Reputation: 30
Confused with the diacritic characters


Hi All,

Am really confused with the diacritic characters display.

Here is my problem!
Code:
Â, ==> which is a  LATIN CAPITAL LETTER A WITH CIRCUMFLEX
I have many characters like the above in a huge file.

Am trying to find out the accompanying characters along with the special diacritic character like.
Code:
Âe
Âu
Âi
similiar to that,

When I paste a specific portion of the file alone in a small file called 'sample'

am getting the value of  ==> as "195 130" which is the exact value

When am reading from the huge file I get the value as "194" which is incorrect

Actually these characters are en-coded in 2 bytes, hence "195 130" should be displayed even when read from huge file as well.

Heres my script!
Code:
#! /opt/third-party/bin/perl

use IO::File;

my $filename = "/datarepos/hugefile";
$source = IO::File->new($filename, 'r');

my $ord;
binmode($source);

while (1) {
  $ch = getc($source);
  print "ch is:$ch\n";
  $ord = ord($ch);
  if( $ord == 0 ) {
    last;
  }
  print "ordval is:" . $ord . "\n";
}

exit 0

For a small file "sample" consisting of just
Code:
CondÂe-sur

For the above small extract it works, fine.

Am puzzled about this behaviour!! ???


Many thanks in advance!
 
Old 03-09-2007, 02:03 AM   #2
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris 11.4, Oracle Linux, Mint, Debian/WSL
Posts: 9,789

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
My first guess is these characters are coded in UTF-8 (or perhaps ANSEL) while your terminal is set to use ISO-8859.
 
Old 03-09-2007, 03:07 AM   #3
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Yes, these are unicode characters coded in utf-8 encoding format

But, I just set the terminal to vt100 and nothing else

what is the setting that I need to make to remove this difference ?

Many thanks for the reply!
 
Old 03-09-2007, 05:29 AM   #4
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris 11.4, Oracle Linux, Mint, Debian/WSL
Posts: 9,789

Rep: Reputation: 492Reputation: 492Reputation: 492Reputation: 492Reputation: 492
The encoding used by your terminal emulator is the one that was set as with an environment variable when you launched it.

You can run an UTF-8 xterm like this:
Code:
LC_ALL=en_US.UTF-8 xterm&
Alternatively, you can convert your file from UTF-8 to ISO-8859 (check the syntax as I'm using Solaris and it may be different with your O/S):
Code:
iconv -f UTF-8 -t 8859-1 file.utf8 > file.iso
 
Old 03-09-2007, 07:51 AM   #5
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
I should have included this information before.

OS : RHEL3
Shell: zsh


Actually, the encoding set am working is not in pure format, which should be in UNIMARC; was badly corrupted, and am trying to fix that.

Before that I just wanted to extract the value and know their 2-byte encoded value and that is where the problem landed up in a confusing way.

I tried your solution to set up LC_ALL and to extract the value directly from the huge file, but unfortunately that didnt work,

since the encoding format am working is UNIMARC specific which needs to be converted to UTF-8

but iconv -l doesnt support that convertion, there is no way except for me to write my own converter for that.

If you could shed upon some light that would be great!

Many thanks for your reply!

 
Old 03-09-2007, 11:46 PM   #6
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,379

Rep: Reputation: 148Reputation: 148
Quote:
Originally Posted by kshkid
When I paste a specific portion of the file alone in a small file called 'sample'

am getting the value of  ==> as "195 130" which is the exact value

When am reading from the huge file I get the value as "194" which is incorrect

Actually these characters are en-coded in 2 bytes, hence "195 130" should be displayed even when read from huge file as well.
In UTF-8 Â will have the code C382(195,130)
In ISO-8859-1 Â will have the code C2 (194)

So it looks as if your file is holding the characters in single byte format.

By the way UTF-8 C382 will convert to U+00C2, so it could be in Unicode code point?
 
Old 03-10-2007, 11:49 PM   #7
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Quote:
By the way UTF-8 C382 will convert to U+00C2, so it could be in Unicode code point?
Question is I really dont know, but how to verify them?

Exactly the following is the unicode pt for the diacritic character in discussion

Code:
U+00C2	Â	195 130	LATIN CAPITAL LETTER A WITH CIRCUMFLEX
Quote:
So it looks as if your file is holding the characters in single byte format.
How do I verify this?

Quote:
In UTF-8 Â will have the code C382(195,130)
In ISO-8859-1 Â will have the code C2 (194)
I could understand Unicode Code point, but what about these C382 and C2 values?
 
Old 03-11-2007, 12:29 AM   #8
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,379

Rep: Reputation: 148Reputation: 148
Well C382 is hex of (195, 130) to convert from UTF-8 to Unicode code point you first need to know the number of bytes, in this case 2, look at it in binary 11000011 10000010 in two byte UTF-8 this is converted using the following pattern:

UTF-8 110yyyyy 10zzzzzz
UNICODE yyyyyzzzzzz

Thus 11000011 10000010 will convert to
yyyyyzzzzzz
00011000010 = 0xC2 = 194

My guess is that the original file is ISO-8859-1, open it with that encoding and see what it looks like.
 
Old 03-11-2007, 01:38 AM   #9
kshkid
Member
 
Registered: Dec 2005
Distribution: RHEL3, FC3
Posts: 383

Original Poster
Rep: Reputation: 30
Many thanks again for the reply !

That helped me a lot!

Basically there isnt any difference in the display of the specific character, either through LC_ALL=en_US.ISO-8859-1 (latin-1)
or
LC_ALL=en_US.UTF8 (utf-8)

Whats puzzling me is, if am copying a specific portion of the file inclusive of the diacritic characters, how is that converted to utf8 format from latin1 ? (from the value of c382 ==> 195, 130)

And one more confirmation,

utf-8
latin-1

are encoding character sets

and Unicode code point, is a code point given to any character belonging to any of the encoding character sets.

So, basically, UCS - Universal Character Set comprises all the encoding character sets. ( Therefore all the encoding character sets can be represented in UCS )

Please do correct me, if am wrong?

Many thanks again!
 
Old 03-11-2007, 07:40 AM   #10
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,379

Rep: Reputation: 148Reputation: 148
The idea of Unicode is to have a scheme that can hold all known character sets, this fits into 4 bytes, which allows for a huge number of characters. There would be a lot of wastage if all documents were always held using 4 bytes, so different schemes have been developed to address this issue.

UTF-8 will hold the traditional ASCII (127 bits) in a single byte, then come the more common character sets, European etc which fit into two bytes, then less popular scripts into 3-bytes (for example the tibetan script is 3 bytes) Then the chinese script (I think because of it's shear size) is placed in the 4-byte region. With this scheme comes a cost, in that bits are reserved to identify if the character is one two three bytes or more. If the left most bit is a zero then it is a single byte character. For two byte characters five bits are reserved 110 on the first byte and then 10 on the next byte. For three byte characters I think that it is eight bits that are required 1110 for the first byte and then 10 for the subsequent bytes.

ASCII and ISO-8859-1 predate UNICODE but there is a certain amount of backwards compatability built in which is useful (especially with pure ASCII) but can also be confusing!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to modify the names of files and replace characters with other characters or symb peter88 Linux - General 2 12-10-2006 03:05 AM
Dead keys or diacritic in OpenOffice - solution jlinkels Linux - Software 1 02-16-2005 10:26 AM
Really confused jeep99899 Linux - Newbie 5 11-10-2004 03:58 PM
Confused on where to get it. LinuxKyle Linux - Software 1 03-07-2004 03:19 AM
I am confused... odious1 Linux - Networking 3 11-01-2003 03:37 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:37 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration