[SOLVED] What characters in Linux (or UNIX) filename will cause problems?

rnturn · 08-22-2016, 07:35 PM

I'm working on a utility that will rename project files to use a standardized naming format and I ran into a problem with a particular character: ":".

The project file naming will have several fields, some of which must be filled in and others that are only used on occasion (version numbers, if necessary, etc.) so successive delimiters are going to be a common sight while navigating the project directories.

Code:

field1:field2::field4::field6.extension

Originally I lobbied for colons (':') for this as they made it fairly easy to see the individual fields. This became a problem when the project file was a tar archive and you tried extracting files from it: tar interpretted everything before the first ":" as a hostname. Oops! Forgot about that feature.

We're down to two characters that we don't think will cause problems when we manipulate these files with standard utilities: '%' and '+'.

Before we take the plunge, can anyone think of any utility that might balk or assume the file is something more than a plain-old file when it sees a '%' or a '+' in a file name?

TIA...

--
Rick

TenTenths · 08-23-2016, 03:00 AM

I wrote this a while back (December 2014 apparently) when I had to deal with files being generated in linux but being used on Windows machines. You may find it useful.

https://centos.tips/fixing-troublesome-filenames/

hydrurga · 08-23-2016, 03:17 AM

Command processors like Bash may interpret % as a special symbol?

Why were '-' and '_' rejected? These are so common in filenames that most applications won't treat them as special.

TenTenths · 08-23-2016, 03:21 AM

Quote:

Originally Posted by hydrurga

Why were '-' and '_' rejected? These are so common in filenames that most applications won't treat them as special.

I've no issue with - and _ and indeed my script uses _ as a replacement for spaces (as I totally hate spaces in filenames!)

rnturn · 08-23-2016, 11:28 AM

Quote:

Originally Posted by hydrurga

Command processors like Bash may interpret % as a special symbol?

Why were '-' and '_' rejected? These are so common in filenames that most applications won't treat them as special.

Primarily because they ARE so common in filenames. We could use them as field delimiters -- and were our first choices -- except that the files we're absorbing into the project contain TONS of those characters and make it impossible to determine whether the characters denotes the transition from one field to the next or whether they are they just a separator between the words someone included in the filename as a description. Some of the files come from Windows systems and the folks sending them to us think they're doing us a favor by replacing the spaces in the Windows filenames with underscores and things like "word - word" with "word_-_word". It's pretty crazy. (Getting them to change their file naming practices would be almost impossible.)

rnturn · 08-23-2016, 11:49 AM

Quote:

Originally Posted by TenTenths

I wrote this a while back (December 2014 apparently) when I had to deal with files being generated in linux but being used on Windows machines. You may find it useful.

https://centos.tips/fixing-troublesome-filenames/

Interesting. I think we've all had to jump through this or a similar hoop at one time or another. I'm (and other on the team, frankly) not too keen on renaming the file too far from their original names.

File renaming isn't our real problem. It's if we break apart the filename and inject some project-related information in the filename, what character can we choose as a delimiter that won't cause some common utility to misinterpret the file name in some way. Like "tar" did when it saw a colon in a tar archive filename.

So far we haven't encountered anyone wanting to add files to the project with '+' in the filename so I'm leaning in that direction. (Anyone who starts using '+' might find their files renamed: '+' -> '_plus_'.) I am concerned, though, that there had to be SOME reason the C++ folks tend to name files using 'cxx' instead of 'c++'. If that was a Windows thing, then maybe we're OK to adopt that.

grail · 08-23-2016, 12:33 PM

Well I do not have a suggestion to come up with an arbitrary symbol which may or may not impact any number of commands, but my first thought was whether or not it is really necessary to have quite so much detail in the name. Surely much of this information would be garnered by actually reading the file. Though at first this seems like an unnecessary rant, my suggestion would be to perhaps simply give the
files more anatomical names or maybe just numbers but then have a reference file that contains a mapping to give you all the extra guff you currently require? in this way you get what you want and
can just as easily use any delimiter you like in the reference file ... just a thought

hydrurga · 08-23-2016, 01:53 PM

Quote:

Originally Posted by rnturn

Primarily because they ARE so common in filenames. We could use them as field delimiters -- and were our first choices -- except that the files we're absorbing into the project contain TONS of those characters and make it impossible to determine whether the characters denotes the transition from one field to the next or whether they are they just a separator between the words someone included in the filename as a description. Some of the files come from Windows systems and the folks sending them to us think they're doing us a favor by replacing the spaces in the Windows filenames with underscores and things like "word - word" with "word_-_word". It's pretty crazy. (Getting them to change their file naming practices would be almost impossible.)

That makes sense. It might still be a safer bet though to use something like __ or -- as your delimiter and process incoming filenames to modify any part of a filename containing that sequence to a reversible value (i.e. one that you can re-process at the other end to obtain the original filename if that's necessary).

As regards +, see here: http://www.tldp.org/LDP/abs/html/special-chars.html

schneidz · 08-23-2016, 02:07 PM

Code:

[schneidz@hyper ~]$ scp hello:world.txt schneidz@mom:Documents
ssh: Could not resolve hostname hello: Name or service not known
[schneidz@hyper ~]$ scp ./hello:world.txt schneidz@mom:Documents
hello:world.txt                                                                100%   12     0.0KB/s   00:00

rnturn · 08-24-2016, 01:30 PM

Quote:

Originally Posted by schneidz

Code:

[schneidz@hyper ~]$ scp hello:world.txt schneidz@mom:Documents
ssh: Could not resolve hostname hello: Name or service not known
[schneidz@hyper ~]$ scp ./hello:world.txt schneidz@mom:Documents
hello:world.txt                                                                100%   12     0.0KB/s   00:00

So "scp" is another utility that would rule out using ':'. Now that I think of it, "cpio" wold be another. Luckily, it would be a little unusual for us to have to use either of those utilities. But tar's interpretting the ':' as part of a hostname was already a show-stopper for using colons.

grail · 08-24-2016, 02:01 PM

I think schneidz example is supposed to demonstrate that even with a colon in the name, if you use the relative or full path to the file then commands like scp can still do their jobs correctly.
The second call with './' in front shows that the command has worked

rnturn · 08-24-2016, 02:09 PM

Quote:

Originally Posted by hydrurga

... a safer bet though to use something like __ or -- as your delimiter and process incoming filenames to modify any part of a filename containing that sequence to a reversible value (i.e. one that you can re-process at the other end to obtain the original filename if that's necessary).

As regards +, see here: http://www.tldp.org/LDP/abs/html/special-chars.html

I considered that. I'm not keen on adding the preprocessing needed to do that -- if only "cut" allowed multi-character delimiters. We could do that if needed as it might be something we could compartmentalize it to the functions we already source to deal with filenames. Could get tricky as empty fields are allowed and we'd have to be certain that strings like "----" were properly replaced *everywhere* with whatever delimiter we choose to get us through the "cut" process:

Code:

NEWSTR=$( echo "${ORIGSTR}" | sed 's/--/:/g' )

At the moment, switching delimiters is a one-line change. Sort of like to keep it that way if at all possible.

Yeah, any incoming filenames would have to be screened to disallow the multi-character delimiters. I can already see cases where users might choose to use these, especially "--".

BTW: Thanks for the link to that page. I think I've seen that some time ago but never bookmarked it. I saw nothing on it suggesting that things would blow up if we were to use '+'. It hasn't been a problem with Bash. So far. It's those external utilities I worry about the most.

rnturn · 08-25-2016, 01:33 PM

Quote:

Originally Posted by grail

I think schneidz example is supposed to demonstrate that even with a colon in the name, if you use the relative or full path to the file then commands like scp can still do their jobs correctly.
The second call with './' in front shows that the command has worked

And I'd have to take the phone call from each user who missed that tip and explain it to them.

So far employing double underscores ("__") and a simple "sed" command to translate that into a single character that keeps "cut" satisfied when breaking the filenames up into their components has been working. So far. I'm working on the interface into the project directories and anyone who tries using "__" in a file they want to use in the project area will receive a little message that the repeating underscores are not allowed and that we're renaming that for them. While we might have the occasional user who dons their beret and wants that artistic license to name files whatever they feel like at the moment, I don't expect too much flak from imposing that restriction.

grail · 08-25-2016, 01:55 PM

Just a side note, but the addition of the use of cut is a bit pointless when you can just use sed to do all the splitting for you (or of course any one of the other fine tools that all do the same job)

rnturn · 08-25-2016, 10:29 PM

Quote:

Originally Posted by grail

Just a side note, but the addition of the use of cut is a bit pointless when you can just use sed to do all the splitting for you (or of course any one of the other fine tools that all do the same job)

"cut" is well understood by all the people (some fairly junior) who might have to get into the innards of any of the scripts that are managing these files. Some of the ways the various components of these filenames could be extracted and assigned to variables could surely be done by other tools ("awk" comes to mind, and "sed" as you mentioned) but I think keeping it simple is a good idea even if it is a little more verbose.

Have a good one...