Linux Filesystem Support for Unicode Filenames

jamescobban · 07-03-2009, 03:53 PM

Most commercial operating systems now support Unicode filenames so that users can give names that are meaningful to them in their own language. However I have sought in vain for similar support in Linux. Even on-going enhancements to Linux file-systems, such as ext4, seem to be ignoring this market requirement. I don't understand this as it does not seem to me that it would take a great deal of development effort to store filenames in the directory as UTF-8 strings rather than ANSI strings. This is a particular issue when mounting a commercial file-system, such as NTFS which stores its filenames as UTF-8, on Linux. What do I code in my C++ program to open a file whose name happens to be Greek, or Russian, or Arabic, or Chinese, when the API definition only accepts const char * filenames?

David the H. · 07-03-2009, 05:33 PM

Uhh, I don't know what world you're living on, but most everything in Linux is UTF-8, including the filesystems. Some of the older terminals and programs can't handle it, but just about everything made in the last decade does.

I have no trouble using Japanese with UTF-8 encodings on my system. If yours can't, it's probably because you just don't have it configured properly for a UTF-8 environment.

See here for international text support in Linux.

BTW, as for filenames, even though you can use non-alphabetic characters, they're still not that easy to deal with. Console-based IME in particular is a bitch. So I generally avoid using anything but the standard western alphabet in my filenames. There's no trouble using other scripts in files and programs, however.

jamescobban · 07-04-2009, 12:55 PM

Quote:

Originally Posted by David the H.

Uhh, I don't know what world you're living on, but most everything in Linux is UTF-8, including the filesystems.

BTW, as for filenames, even though you can use non-alphabetic characters, they're still not that easy to deal with.

I thought I was being very specific when I referred to Unicode filenames in my title. There is of course no problem with support of internationalization itself in Linux.

Specifically my concern is what do I code in a C or C++ program in order to access a file whose name contains characters which are not part of the 7-bit ANSI character set. You use the term "non-alphabetic" to refer to these non-English characters, but considering that "alpha" and "beta" are both characters which are not representable in 7-bit ANSI and are in fact the prototypes of "alphabetic" characters, your terminology is ethnocentric.

All of the C and C++ library interfaces specify that filenames are expressed as char *. In my opinion UTF-8 should be used only as the preferred external representation of Unicode character strings.

Preferred because UTF-8 is the only completely unambiguous representation of Unicode, because of problems with byte ordering on various systems.
External because it is inappropriate for internal representation because it takes a variable number of bytes to represent each letter/glyph.

Therefore I do not believe that char * is a valid internal reference to Unicode strings, even though their external representation, for example in the directory structure of a filesystem, might be UTF-8.

In my opinion the only natural internal representation of Unicode is wchar_t *, or equivalents such as std::wstring or wxString. Therefore I would expect to be able to open a file, or inquire about file characteristics, on a file system which supports Unicode file names using a filename represented in wchar_t * or equivalent. Admittedly wxWidgets provides a wrapper around the standard API that permits passing wxString filenames, but that wrapper uses the locale specific translation to render them down to char * to satisfy the standard API.

The justification for why wchar_t * is not supported by the standard API appears to be because not all file systems support Unicode file names. I have a few problems with this:

Traditional file systems, such as ext3 and FAT, support 8-bit characters in their directories which are interpreted according to the Locale. So if a Greek, or a Russian, or an Arab, gives a filename a meaningful name in his local codepage, it will display as gibberish to anyone using a different code page. I don't find that desirable behavior.
A concern is raised that if the API supported wchar_t * that the program might pass Unicode characters in a filename that cannot be represented in the directory of the target filesystem. However the API already defines an errno value to be set if the application tries to do this, so why not allow it for filesystems that do support Unicode.

Now we get into a circular argument. Linux filesystems, such as ext4, have chosen to not support Unicode apparently because the C API does not support Unicode filenames, but the C API does not support Unicode filenames because the filesystems don't support them.

I would appreciate an informed discussion of how to get out of this Catch 22.

Samotnik · 07-04-2009, 02:03 PM

Internal representation of unicode in gnulibc is provided by wchar_t type, but for in/out such a strings in utf-8 there are set of functions to convert wide char to multibyte and back.
You should read documentation before asking such a questions.

jamescobban · 07-04-2009, 08:54 PM

Quote:

Originally Posted by Samotnik

Internal representation of unicode in gnulibc is provided by wchar_t type, but for in/out such a strings in utf-8 there are set of functions to convert wide char to multibyte and back.
You should read documentation before asking such a questions.

I honestly do not understand your point. I have repeatedly stated that my concern is about how to represent filenames, not application data. There exist standard library conversions between wchar_t * and UTF-8 and other multi-byte representations, but none of those conversions are relevant if the file API does not permit me to pass Unicode file names, whether in wchar_t * format or as UTF-8.

If this is not an appropriate forum for this sort of discussion I would welcome a re-direct to a forum which does deal with architectural issues.

David the H. · 07-05-2009, 01:07 AM

Truthfully, I'm not a programmer, so this is far beyond my knowledge area. But my real point when I posted was that Unicode IS supported, somehow. How could I be using it on my computer if there weren't SOME way for programs to read and write it to these filesystems?

All you need to do is find out how all these other programs are doing it. Why not check out the code for some of the and see what's being used already?

Su-Shee · 07-05-2009, 05:45 AM

With my preferred mixed locale settings set to encoding UTF-8, I have no problems whatsoever for years now to handle non-latin-characters in Unicode in _filenames_ on ext3 - neither russian nor japanese nor french or my german umlauts.

I don't understand either what you mean be "the API doesn't permit passing Unicode filenames".

What's ls, mv, cp, grep, less, touch, vim, mplayer and many other applications opening or handling files with unicode-filenames actually doing then?

I've just touch'ed a testfile with a unicode-filename, cp'ed, less'ed and opened and saved it with vim, grep'ed it, copied it via nfs onto our fileserver and opened it locally with Firefox (which also has to open unicode filenames somehow...). All these apps have to handle the filename and all of them are either C or C++.

"mc" though still needs a patch to handle unicode filenames encoded in utf-8 and there are several unicode filename problems with the use of Samba- and Mac-shares - at least that's what I've read.

Also useful: http://www.cl.cam.ac.uk/~mgk25/unicode.html

David the H. · 07-05-2009, 06:25 AM

I decided to look around a bit, and I found this page, which has this to say:

Quote:

As noted above, the Linux kernel doesn't care about character encodings. For common Linux filesystems (ext2, ext3, ReiserFS, and other filesystems typical for Unices), information that a particular filesystem uses one encoding or another is not stored as a part of that filesystem. Only locale-controlling environment variables tell software that particular bytes should be displayed as one or another character. Filesystems found on Microsoft Windows machines (NTFS and FAT) are different in that they store filenames on disk in some particular encoding. The kernel must translate this encoding to the system encoding, which will be UTF-8 in our case.

So unlike with Windows-based filesystems, the whole topic of the encoding at the filesystem level is a moot point. The translation occurs at a higher-level.

BTW, as for my "terminology, it's not "ethnocentric" in any way. It's just ignorance of what the proper terms are. Please try to avoid reading too much into what a person says.

jamescobban · 07-05-2009, 12:30 PM

Quote:

Originally Posted by Su-Shee

I don't understand either what you mean be "the API doesn't permit passing Unicode filenames".

What's ls, mv, cp, grep, less, touch, vim, mplayer and many other applications opening or handling files with unicode-filenames actually doing then?

The API specifies that filenames must be passed as char *. It does not specify what the char * actually points at. In fact
the API documentation says nothing about Unicode. Strictly speaking Unicode is a representation that requires at least 24 bits to represent all currently assigned code-points. UTF-8 is a supported external representation of Unicode but has, as I pointed out, the deficiency as an internal representation that it requires a variable number of bytes to represent a single character/glyph. Furthermore UTF-8 is not used to represent characters strings within a GUI, although GUIs provide transform functions between Unicode strings and both UTF-8 and the Locale code page. The Locale is something that you, the user, set, not the author of the file, or the creator of the filesystem.

Just picking "ls" as an example, read through the man page and the info carefully and you will see that it says nothing about filenames except that they are displayed. Oh there is an option to ask for hex output of "non-graphic" characters, and a mention, again, that your setting for Locale will affect the sort order. As I read this documentation the assumption is that filenames are encoded in a single byte per character/glyph representation according to the Locale specific code page. The sort order issue arises because languages other than English do not sort alphabetic strings in byte code order. This has nothing to do with languages that require multiple bytes to represent individual characters.

My point remains that because of this peculiar circular logic Linux file systems do not either enforce a particular interpretation of file names, as is done in commercial file systems such as NTFS, or record the Locale used to create a file name. As far as they are concerned the file name is a sequence of bytes, a small handful of which have special meanings. So if someone else creates a filename according to one particular Locale, when you try to open it the filename will be gibberish if you have not set the same Locale. In my opinion that is unacceptable behavior. I expect software to behave in a predictable manner. I should be permitted to pass Unicode strings, which have a universal definition, to the file API, and get the same result no matter what the Locale has been set to.

I simply find this frustrating because Unicode has been around more than 20 years, and yet its very existence, outside of GUIs, is ignored.

jamescobban · 07-05-2009, 12:44 PM

Quote:

Originally Posted by David the H.

I decided to look around a bit, and I found this page, which has this to say:

So unlike with Windows-based filesystems, the whole topic of the encoding at the filesystem level is a moot point. The translation occurs at a higher-level.

Thank you for pointing that article out to me.

I am still not satisfied that pretending the problem does not exist is acceptable. Obviously Apple and Microsoft do not feel that it is acceptable. The failure of the Linux community to act on this issue will, in my opinion, be an obstacle to commercial acceptance.

With Linux, because apparently the decision has been made to leave handling of Unicode up to the application layer, there must be agreement between all of the partners that a particular encoding is to be used for filenames. If there is any disagreement, the result is gibberish.

I appreciate that Linux is a Libertarian operating system, and that I am not going to be coddled the way I was on Windows or OS/X, but I should be entitled to expect the behavior of the operating system to be predictable at least. It may be Garbage In Garbage Out, but I would like to get the same garbage out that was put in.

David the H. · 07-07-2009, 01:37 PM

Well, if you think it's wrong, start petitioning the developers of these filesystems, compilers, kernels, and whomever else to change it. Perhaps if you're convincing enough you'll be able to get future versions to include such things.

But I'll leave you with this thought. These guys aren't stupid. Some of the smartest and most-famous names in the business have worked on this stuff. I'll bet you anything that they've already had long debates on this topic, and they probably have very good reasons for choosing to do it the way they did. I doubt highly it's the simple catch-22 conundrum you theorize it as being.