ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
The problem is related to 2 byte character representation problem. what i have to do is if the filename is unicode i have to ignore other wise i have to process the file. i.e i have to process only ascii files. I dont know how can i check whether filename is unicoded or not. Can u plz tell me whether there is any C api or function is there which can be used for this. Or some pointers on this. Thanks in advance.
I think you misunderstand what Unicode is. Unicode is just an abstract mapping that assigns an integer to each character or modifier. How it is actually represented in data depends on the encoding.
The two most common encodings are UTF-8 and UTF-16. UTF-8 is ASCII-compatible (meaning anything in ASCII is also trivially considered UTF-8 encoded), almost universally used on Unix-like systems, and takes 1-4 bytes per character. UTF-16 takes 2-4 bytes per character. Both of these could use "2 bytes" per character but no Unicode encoding always uses "2 bytes", so it's incorrect to associate Unicode with "2 bytes".
If your job is to distinguish ASCII from non-ASCII then that is easy: pure ASCII only uses characters 0-127; so if it contains any byte that has a value that is 128-255 it is not pure ASCII.
Hi spooon,
Thanx for u r reply. Yah i mis understood . What i have to do is If the filename contains a character which occupies two bytes or more , i have to ignore the file name, other wise i have to process the filename.
Distribution: Solaris 11.4, Oracle Linux, Mint, Debian/WSL
Posts: 9,789
Rep:
A filename, at least on unix, is just a 8 bit character array with nothing about which encoding is used, so there is no definitive way to figure out if a file name is to be represented with one or another encoding.
1/ Filenames' encoding is set as an option for some filesystems in the /etc/fstab file.
2/ I can see good reasons for wanting to detect non-UTF files (eg: transforming them into UTF), and no: detecting non-ASCII is not enough (eg: ISO-8859-1 is not UTF) unless you really only use ASCII (in which case UTF files will be identical anyway).
I don't know of any 100%-reliable method for doing such detection. The best solution probably is to parse the file for UTF conformity.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.