Handling all file names safely in Bash
Posted 02-11-2012 at 12:09 PM by Nominal Animal
Updated 04-16-2012 at 02:41 AM by Nominal Animal (Fuller fixes.)
Updated 04-16-2012 at 02:41 AM by Nominal Animal (Fuller fixes.)
In Linux, each file or directory name (or more generally, pathname component) is just a string of bytes. It always ends with the C end-of-string mark, ASCII NUL: a zero. Value 47, ASCII /, is also reserved for use as a separator between pathnames.
Bash can read ASCII NUL separated data using read -d "" variable. It will, however, remove leading and trailing characters that match IFS, and return false (nonzero status) if the input does not have a final NUL. This applies to all separators, not just NUL.
Bash arrays and variables work perfectly well with all possible pathnames, as long as you quote the variables properly.
A major exception is related to character sets. Many C library functions treat invalid UTF-8 sequences as fatal errors. This means that you cannot reliably handle non-UTF-8 file names (or strings) in a UTF-8 locale. This is not really a problem; you only need to remember to set the POSIX locale, LANG=C LC_ALL=C. Also note that in UTF-8 locales, Bash counts characters, not bytes, when evaluating e.g. string length. In the POSIX locale, characters are just bytes.
Before we get to the point, there are a couple of details in Bash array handling I'd like to point out. The default array constructor is slow, but extremely reliable:
adds newitem to the end of the array in arrayvariable, but also reindexes the array, starting indices from zero. If you say unset arrayvariable[3] you will get a hole in the array (while the array length will of course decrease by one); you need to arrayvariable=("${arrayvariable[@]}") to reindex the array after the deletion(s) to get rid of the hole. If you know your array does not have holes,
is very fast (on my machine, 300 times faster) than the previous method. If the array has holes, it will overwrite an existing value, though.
Back to the issue at hand.
There are two main methods of getting pathnames into a Bash array: globbing, and using GNU find with ASCII NUL delimiters.
Globbing is extremely simple:
The shell options obviously need to only be set at the start of your script. The nullglob shell option is critical: it is unset by default, so patterns that did not match any files evaluate to the pattern itself.
The syntax looks deceptively simple, but just remember: Bash takes the glob pattern, *.c (or multiple glob patterns like *.[ch] *.cpp) and expands it into the files array. Word splitting (IFS) is not performed in this case.
(Word splitting would be done if you emitted the names using some command. Here, there is no command, just a glob pattern Bash interprets. It is a very big difference, and should be remembered.)
Bash version 4 and later can do recursive globbing, too:
It is easiest to grasp if you consider ** to mean anything in this directory or its subdirectory. Thus, **/*.c means [I]any file or directory ending with .c in any directory under current directory. Note, it does not include the current directory, only its descendant directories.
Note that **.c will not look for files ending with .c in all directories; only in current directory, and any subdirectories (or their subdirectories) that all have names ending with .c.
Since only directories match a trailing slash, **/ matches all subdirectories and their subdirectories and so on, recursively.
To use the arrays thus generates, you must use proper quotes to keep whitespace intact:
Otherwise, Bash would take the array, expand it, then re-split the expanded string. You don't want to do that, so do remember the double quotes.
The second method is to use GNU find to list the desired files using ASCII NULs as separators. You can either use the -print0 option to print the full paths to the files, or -printf 'pattern\0' to fine-tune the output. The latter is especially useful if you want e.g. the modification timestamps of the files, too.
Because the right side of a pipe | is always a subshell, and changes to Bash variables in a subshell do not propagate to the parent, you'll want to use input redirection. Here is a safe example:
There is a difference between a value in the LANG and LC_ALL variables, even an empty one, and them being unset. The first and last four lines are the way to retain them in their exact state, after the loop.
In the loop, we clear the file variable, so we can detect whether or not read did read a record. Then, we clear the IFS temporarily, so that Bash will not trim the file name, and read the record. If the record does not have a final ASCII NUL, read will return false, but file will not be empty. This stanza breaks out of the loop only when there is no more data to read.
We add each file variable to the files array.
I use this GNU find -based approach, very often. You can even prepend the file names with other useful data, and extract them using Bash string operations. For example:
Note that although I used space as the separator in the find output, I could just as well have used /, |, or just about any other character.
Don't use ! as a separator, though. set -H enables history substitution with !, and is the default. When enabled, a ! even in a double-quoted string is special to Bash. Also, ? and * are globbing characters, and must be escaped using \ if used in a string manipulation operation. In general, ! \ ? $ * [ ] { } ( ) should be avoided in string operations, because they're difficult to remember to escape correctly.)
Finally, GNU awk (gawk) can also process ASCII NUL -delimited data, but it is a GNU extension, and not supported by all awks in general. Just set RS="\0" and optionally FS="\0" too if you don't need field splitting. You'll want to set the locale explicitly to POSIX (using LANG=C LC_ALL=C awk ...) or it will abort if it encounters a non-UTF-8 sequence in an UTF-8 locale.
Bash can read ASCII NUL separated data using read -d "" variable. It will, however, remove leading and trailing characters that match IFS, and return false (nonzero status) if the input does not have a final NUL. This applies to all separators, not just NUL.
Bash arrays and variables work perfectly well with all possible pathnames, as long as you quote the variables properly.
A major exception is related to character sets. Many C library functions treat invalid UTF-8 sequences as fatal errors. This means that you cannot reliably handle non-UTF-8 file names (or strings) in a UTF-8 locale. This is not really a problem; you only need to remember to set the POSIX locale, LANG=C LC_ALL=C. Also note that in UTF-8 locales, Bash counts characters, not bytes, when evaluating e.g. string length. In the POSIX locale, characters are just bytes.
Before we get to the point, there are a couple of details in Bash array handling I'd like to point out. The default array constructor is slow, but extremely reliable:
Code:
arrayvariable=("${arrayvariable[@]}" "$newitem")
Code:
arrayvariable[${#arrayvariable[@]}]="$newitem"
Back to the issue at hand.
There are two main methods of getting pathnames into a Bash array: globbing, and using GNU find with ASCII NUL delimiters.
Globbing is extremely simple:
Code:
set +f # Enable filename globbing set -B # Enable brace expansion shopt -s dotglob # Include names that start with a dot shopt -s nullglob # No match expands to an empty string files=(*.c) dirsonly=(*/)
The syntax looks deceptively simple, but just remember: Bash takes the glob pattern, *.c (or multiple glob patterns like *.[ch] *.cpp) and expands it into the files array. Word splitting (IFS) is not performed in this case.
(Word splitting would be done if you emitted the names using some command. Here, there is no command, just a glob pattern Bash interprets. It is a very big difference, and should be remembered.)
Bash version 4 and later can do recursive globbing, too:
Code:
set +f # Enable filename globbing set -B # Enable brace expansion shopt -s dotglob # Include names that start with a dot shopt -s nullglob # No match expands to an empty string shopt -s globstar # ** does recursive globbing files=(**/*.c) filesdirs=(**) tree=(**/)
Note that **.c will not look for files ending with .c in all directories; only in current directory, and any subdirectories (or their subdirectories) that all have names ending with .c.
Since only directories match a trailing slash, **/ matches all subdirectories and their subdirectories and so on, recursively.
To use the arrays thus generates, you must use proper quotes to keep whitespace intact:
Code:
command "${files[@]}"
The second method is to use GNU find to list the desired files using ASCII NULs as separators. You can either use the -print0 option to print the full paths to the files, or -printf 'pattern\0' to fine-tune the output. The latter is especially useful if you want e.g. the modification timestamps of the files, too.
Because the right side of a pipe | is always a subshell, and changes to Bash variables in a subshell do not propagate to the parent, you'll want to use input redirection. Here is a safe example:
Code:
# Make sure we don't abort on non-UTF-8 byte sequences. test -v LANG && OLD_LANG="$LANG" || unset OLD_LANG ; LANG=C test -v LC_ALL && OLD_LC_ALL="$LC_ALL" || unset OLD_LC_ALL ; LC_ALL=C # Clear the files array files=() while [ 1 ]; do file="" IFS="" read -rd "" file || [ -n "$file" ] || break files[${#files[@]}]="$file" done < <( find . -type f -print0 ) # # Do something with the file paths in the files array # # Revert to original locale. test -v OLD_LANG && LANG="$OLD_LANG" || unset LANG test -v OLD_LC_ALL && LC_ALL="$OLD_LC_ALL" || unset LC_ALL
In the loop, we clear the file variable, so we can detect whether or not read did read a record. Then, we clear the IFS temporarily, so that Bash will not trim the file name, and read the record. If the record does not have a final ASCII NUL, read will return false, but file will not be empty. This stanza breaks out of the loop only when there is no more data to read.
We add each file variable to the files array.
I use this GNU find -based approach, very often. You can even prepend the file names with other useful data, and extract them using Bash string operations. For example:
Code:
# Make sure we don't abort on non-UTF-8 byte sequences. test -v LANG && OLD_LANG="$LANG" || unset OLD_LANG ; LANG=C test -v LC_ALL && OLD_LC_ALL="$LC_ALL" || unset OLD_LC_ALL ; LC_ALL=C # count is the number of files, # names contains the file names (indices 0 to count-1) # dates contains the modification dates, # sizes the file sizes in bytes. count=0 names=() dates=() sizes=() while [ 1 ]; do item="" IFS="" read -rd "" item || [ -n "$item" ] || break time="${item#* }" # item with first word removed date="${item%% *}" # first word of item size="${time#* }" # time with first word removed time="${time%% *}" # first word of time (second word of item) name="${size#* }" # size with first word removed size="${size%% *}" # first word of size (third word of item) # date=YYMMDD # time=HHMMSS or HHMMSS.fractions # size=file size in bytes # To skip adding this file, use 'continue' # Add this to the arrays names[count]="$name" dates[count]="$date" sizes[count]="$size" count=$[count+1] done < <( find . -type f -printf '%TY%Tm%Td %TT %s %p\0' ) # # Do something with the names, dates, and sizes arrays, # indices 0 (first file) to count-1 (last file). # # Revert to original locale. test -v OLD_LANG && LANG="$OLD_LANG" || unset LANG test -v OLD_LC_ALL && LC_ALL="$OLD_LC_ALL" || unset LC_ALL
Don't use ! as a separator, though. set -H enables history substitution with !, and is the default. When enabled, a ! even in a double-quoted string is special to Bash. Also, ? and * are globbing characters, and must be escaped using \ if used in a string manipulation operation. In general, ! \ ? $ * [ ] { } ( ) should be avoided in string operations, because they're difficult to remember to escape correctly.)
Finally, GNU awk (gawk) can also process ASCII NUL -delimited data, but it is a GNU extension, and not supported by all awks in general. Just set RS="\0" and optionally FS="\0" too if you don't need field splitting. You'll want to set the locale explicitly to POSIX (using LANG=C LC_ALL=C awk ...) or it will abort if it encounters a non-UTF-8 sequence in an UTF-8 locale.
Total Comments 0