unicode file

rajesh_b · 09-01-2005, 04:01 AM

Hi all,

The problem is related to 2 byte character representation problem. what i have to do is if the filename is unicode i have to ignore other wise i have to process the file. i.e i have to process only ascii files. I dont know how can i check whether filename is unicoded or not. Can u plz tell me whether there is any C api or function is there which can be used for this. Or some pointers on this. Thanks in advance.

Regards
Rajesh

spooon · 09-01-2005, 05:15 AM

I think you misunderstand what Unicode is. Unicode is just an abstract mapping that assigns an integer to each character or modifier. How it is actually represented in data depends on the encoding.

The two most common encodings are UTF-8 and UTF-16. UTF-8 is ASCII-compatible (meaning anything in ASCII is also trivially considered UTF-8 encoded), almost universally used on Unix-like systems, and takes 1-4 bytes per character. UTF-16 takes 2-4 bytes per character. Both of these could use "2 bytes" per character but no Unicode encoding always uses "2 bytes", so it's incorrect to associate Unicode with "2 bytes".

If your job is to distinguish ASCII from non-ASCII then that is easy: pure ASCII only uses characters 0-127; so if it contains any byte that has a value that is 128-255 it is not pure ASCII.

rajesh_b · 09-01-2005, 11:28 PM

Hi spooon,
Thanx for u r reply. Yah i mis understood . What i have to do is If the filename contains a character which occupies two bytes or more , i have to ignore the file name, other wise i have to process the filename.

Rajesh

jlliagre · 09-02-2005, 01:07 AM

A filename, at least on unix, is just a 8 bit character array with nothing about which encoding is used, so there is no definitive way to figure out if a file name is to be represented with one or another encoding.

addy86 · 09-02-2005, 04:38 AM

Isn't the encoding saved somewhere in the description of the file system?

theYinYeti · 09-02-2005, 06:29 AM

1/ Filenames' encoding is set as an option for some filesystems in the /etc/fstab file.

2/ I can see good reasons for wanting to detect non-UTF files (eg: transforming them into UTF), and no: detecting non-ASCII is not enough (eg: ISO-8859-1 is not UTF) unless you really only use ASCII (in which case UTF files will be identical anyway).
I don't know of any 100%-reliable method for doing such detection. The best solution probably is to parse the file for UTF conformity.

Yves.