LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-22-2016, 07:35 PM   #1
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
What characters in Linux (or UNIX) filename will cause problems?


I'm working on a utility that will rename project files to use a standardized naming format and I ran into a problem with a particular character: ":".

The project file naming will have several fields, some of which must be filled in and others that are only used on occasion (version numbers, if necessary, etc.) so successive delimiters are going to be a common sight while navigating the project directories.

Code:
field1:field2::field4::field6.extension
Originally I lobbied for colons (':') for this as they made it fairly easy to see the individual fields. This became a problem when the project file was a tar archive and you tried extracting files from it: tar interpretted everything before the first ":" as a hostname. Oops! Forgot about that feature.

We're down to two characters that we don't think will cause problems when we manipulate these files with standard utilities: '%' and '+'.

Before we take the plunge, can anyone think of any utility that might balk or assume the file is something more than a plain-old file when it sees a '%' or a '+' in a file name?

TIA...

--
Rick

Last edited by rnturn; 08-22-2016 at 07:37 PM.
 
Old 08-23-2016, 03:00 AM   #2
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 3,475

Rep: Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553
I wrote this a while back (December 2014 apparently) when I had to deal with files being generated in linux but being used on Windows machines. You may find it useful.

https://centos.tips/fixing-troublesome-filenames/
 
Old 08-23-2016, 03:17 AM   #3
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Command processors like Bash may interpret % as a special symbol?

Why were '-' and '_' rejected? These are so common in filenames that most applications won't treat them as special.
 
Old 08-23-2016, 03:21 AM   #4
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 3,475

Rep: Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553
Quote:
Originally Posted by hydrurga View Post
Why were '-' and '_' rejected? These are so common in filenames that most applications won't treat them as special.
I've no issue with - and _ and indeed my script uses _ as a replacement for spaces (as I totally hate spaces in filenames!)
 
1 members found this post helpful.
Old 08-23-2016, 11:28 AM   #5
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by hydrurga View Post
Command processors like Bash may interpret % as a special symbol?

Why were '-' and '_' rejected? These are so common in filenames that most applications won't treat them as special.
Primarily because they ARE so common in filenames. We could use them as field delimiters -- and were our first choices -- except that the files we're absorbing into the project contain TONS of those characters and make it impossible to determine whether the characters denotes the transition from one field to the next or whether they are they just a separator between the words someone included in the filename as a description. Some of the files come from Windows systems and the folks sending them to us think they're doing us a favor by replacing the spaces in the Windows filenames with underscores and things like "word - word" with "word_-_word". It's pretty crazy. (Getting them to change their file naming practices would be almost impossible.)
 
Old 08-23-2016, 11:49 AM   #6
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by TenTenths View Post
I wrote this a while back (December 2014 apparently) when I had to deal with files being generated in linux but being used on Windows machines. You may find it useful.

https://centos.tips/fixing-troublesome-filenames/
Interesting. I think we've all had to jump through this or a similar hoop at one time or another. I'm (and other on the team, frankly) not too keen on renaming the file too far from their original names.

File renaming isn't our real problem. It's if we break apart the filename and inject some project-related information in the filename, what character can we choose as a delimiter that won't cause some common utility to misinterpret the file name in some way. Like "tar" did when it saw a colon in a tar archive filename.

So far we haven't encountered anyone wanting to add files to the project with '+' in the filename so I'm leaning in that direction. (Anyone who starts using '+' might find their files renamed: '+' -> '_plus_'.) I am concerned, though, that there had to be SOME reason the C++ folks tend to name files using 'cxx' instead of 'c++'. If that was a Windows thing, then maybe we're OK to adopt that.

Last edited by rnturn; 08-23-2016 at 11:51 AM. Reason: grammar-challenged this morning
 
Old 08-23-2016, 12:33 PM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Well I do not have a suggestion to come up with an arbitrary symbol which may or may not impact any number of commands, but my first thought was whether or not it is really necessary to have quite so much detail in the name. Surely much of this information would be garnered by actually reading the file. Though at first this seems like an unnecessary rant, my suggestion would be to perhaps simply give the
files more anatomical names or maybe just numbers but then have a reference file that contains a mapping to give you all the extra guff you currently require? in this way you get what you want and
can just as easily use any delimiter you like in the reference file ... just a thought
 
Old 08-23-2016, 01:53 PM   #8
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Quote:
Originally Posted by rnturn View Post
Primarily because they ARE so common in filenames. We could use them as field delimiters -- and were our first choices -- except that the files we're absorbing into the project contain TONS of those characters and make it impossible to determine whether the characters denotes the transition from one field to the next or whether they are they just a separator between the words someone included in the filename as a description. Some of the files come from Windows systems and the folks sending them to us think they're doing us a favor by replacing the spaces in the Windows filenames with underscores and things like "word - word" with "word_-_word". It's pretty crazy. (Getting them to change their file naming practices would be almost impossible.)
That makes sense. It might still be a safer bet though to use something like __ or -- as your delimiter and process incoming filenames to modify any part of a filename containing that sequence to a reversible value (i.e. one that you can re-process at the other end to obtain the original filename if that's necessary).

As regards +, see here: http://www.tldp.org/LDP/abs/html/special-chars.html
 
Old 08-23-2016, 02:07 PM   #9
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
Code:
[schneidz@hyper ~]$ scp hello:world.txt schneidz@mom:Documents
ssh: Could not resolve hostname hello: Name or service not known
[schneidz@hyper ~]$ scp ./hello:world.txt schneidz@mom:Documents
hello:world.txt                                                                100%   12     0.0KB/s   00:00
 
Old 08-24-2016, 01:30 PM   #10
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by schneidz View Post
Code:
[schneidz@hyper ~]$ scp hello:world.txt schneidz@mom:Documents
ssh: Could not resolve hostname hello: Name or service not known
[schneidz@hyper ~]$ scp ./hello:world.txt schneidz@mom:Documents
hello:world.txt                                                                100%   12     0.0KB/s   00:00
So "scp" is another utility that would rule out using ':'. Now that I think of it, "cpio" wold be another. Luckily, it would be a little unusual for us to have to use either of those utilities. But tar's interpretting the ':' as part of a hostname was already a show-stopper for using colons.
 
Old 08-24-2016, 02:01 PM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I think schneidz example is supposed to demonstrate that even with a colon in the name, if you use the relative or full path to the file then commands like scp can still do their jobs correctly.
The second call with './' in front shows that the command has worked
 
Old 08-24-2016, 02:09 PM   #12
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by hydrurga View Post
... a safer bet though to use something like __ or -- as your delimiter and process incoming filenames to modify any part of a filename containing that sequence to a reversible value (i.e. one that you can re-process at the other end to obtain the original filename if that's necessary).

As regards +, see here: http://www.tldp.org/LDP/abs/html/special-chars.html
I considered that. I'm not keen on adding the preprocessing needed to do that -- if only "cut" allowed multi-character delimiters. We could do that if needed as it might be something we could compartmentalize it to the functions we already source to deal with filenames. Could get tricky as empty fields are allowed and we'd have to be certain that strings like "----" were properly replaced *everywhere* with whatever delimiter we choose to get us through the "cut" process:
Code:
NEWSTR=$( echo "${ORIGSTR}" | sed 's/--/:/g' )
At the moment, switching delimiters is a one-line change. Sort of like to keep it that way if at all possible.

Yeah, any incoming filenames would have to be screened to disallow the multi-character delimiters. I can already see cases where users might choose to use these, especially "--".

BTW: Thanks for the link to that page. I think I've seen that some time ago but never bookmarked it. I saw nothing on it suggesting that things would blow up if we were to use '+'. It hasn't been a problem with Bash. So far. It's those external utilities I worry about the most.
 
Old 08-25-2016, 01:33 PM   #13
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by grail View Post
I think schneidz example is supposed to demonstrate that even with a colon in the name, if you use the relative or full path to the file then commands like scp can still do their jobs correctly.
The second call with './' in front shows that the command has worked
And I'd have to take the phone call from each user who missed that tip and explain it to them.

So far employing double underscores ("__") and a simple "sed" command to translate that into a single character that keeps "cut" satisfied when breaking the filenames up into their components has been working. So far. I'm working on the interface into the project directories and anyone who tries using "__" in a file they want to use in the project area will receive a little message that the repeating underscores are not allowed and that we're renaming that for them. While we might have the occasional user who dons their beret and wants that artistic license to name files whatever they feel like at the moment, I don't expect too much flak from imposing that restriction.
 
Old 08-25-2016, 01:55 PM   #14
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Just a side note, but the addition of the use of cut is a bit pointless when you can just use sed to do all the splitting for you (or of course any one of the other fine tools that all do the same job)
 
Old 08-25-2016, 10:29 PM   #15
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Original Poster
Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by grail View Post
Just a side note, but the addition of the use of cut is a bit pointless when you can just use sed to do all the splitting for you (or of course any one of the other fine tools that all do the same job)
"cut" is well understood by all the people (some fairly junior) who might have to get into the innards of any of the scripts that are managing these files. Some of the ways the various components of these filenames could be extracted and assigned to variables could surely be done by other tools ("awk" comes to mind, and "sed" as you mentioned) but I think keeping it simple is a good idea even if it is a little more verbose.

Have a good one...
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] bash: Test if first 2 characters in a filename are numbers? DJCharlie Programming 8 10-22-2010 08:51 AM
[SOLVED] How can I replace a certain range of characters in linux/unix? btacuso Programming 6 03-11-2010 09:38 AM
multilanguage filename characters issue ovidnet Linux - Desktop 4 10-10-2007 03:10 PM
Trying to delete a filename with special characters Harry Seldon Linux - General 11 03-20-2007 01:31 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:44 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration