LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Blogs > Nominal Animal
User Name
Password

Notices


Rate this Entry

Handling all file names safely in Bash

Posted 02-11-2012 at 12:09 PM by Nominal Animal
Updated 04-16-2012 at 02:41 AM by Nominal Animal (Fuller fixes.)
Tags bash, filename, nul

In Linux, each file or directory name (or more generally, pathname component) is just a string of bytes. It always ends with the C end-of-string mark, ASCII NUL: a zero. Value 47, ASCII /, is also reserved for use as a separator between pathnames.

Bash can read ASCII NUL separated data using read -d "" variable. It will, however, remove leading and trailing characters that match IFS, and return false (nonzero status) if the input does not have a final NUL. This applies to all separators, not just NUL.

Bash arrays and variables work perfectly well with all possible pathnames, as long as you quote the variables properly.

A major exception is related to character sets. Many C library functions treat invalid UTF-8 sequences as fatal errors. This means that you cannot reliably handle non-UTF-8 file names (or strings) in a UTF-8 locale. This is not really a problem; you only need to remember to set the POSIX locale, LANG=C LC_ALL=C. Also note that in UTF-8 locales, Bash counts characters, not bytes, when evaluating e.g. string length. In the POSIX locale, characters are just bytes.

Before we get to the point, there are a couple of details in Bash array handling I'd like to point out. The default array constructor is slow, but extremely reliable:
Code:
arrayvariable=("${arrayvariable[@]}" "$newitem")
adds newitem to the end of the array in arrayvariable, but also reindexes the array, starting indices from zero. If you say unset arrayvariable[3] you will get a hole in the array (while the array length will of course decrease by one); you need to arrayvariable=("${arrayvariable[@]}") to reindex the array after the deletion(s) to get rid of the hole. If you know your array does not have holes,
Code:
arrayvariable[${#arrayvariable[@]}]="$newitem"
is very fast (on my machine, 300 times faster) than the previous method. If the array has holes, it will overwrite an existing value, though.

Back to the issue at hand.

There are two main methods of getting pathnames into a Bash array: globbing, and using GNU find with ASCII NUL delimiters.

Globbing is extremely simple:
Code:
set +f                  # Enable filename globbing
set -B                  # Enable brace expansion
shopt -s dotglob        # Include names that start with a dot
shopt -s nullglob       # No match expands to an empty string

files=(*.c)
dirsonly=(*/)
The shell options obviously need to only be set at the start of your script. The nullglob shell option is critical: it is unset by default, so patterns that did not match any files evaluate to the pattern itself.

The syntax looks deceptively simple, but just remember: Bash takes the glob pattern, *.c (or multiple glob patterns like *.[ch] *.cpp) and expands it into the files array. Word splitting (IFS) is not performed in this case.

(Word splitting would be done if you emitted the names using some command. Here, there is no command, just a glob pattern Bash interprets. It is a very big difference, and should be remembered.)

Bash version 4 and later can do recursive globbing, too:
Code:
set +f                  # Enable filename globbing
set -B                  # Enable brace expansion
shopt -s dotglob        # Include names that start with a dot
shopt -s nullglob       # No match expands to an empty string
shopt -s globstar       # ** does recursive globbing

files=(**/*.c)
filesdirs=(**)
tree=(**/)
It is easiest to grasp if you consider ** to mean anything in this directory or its subdirectory. Thus, **/*.c means [I]any file or directory ending with .c in any directory under current directory. Note, it does not include the current directory, only its descendant directories.

Note that **.c will not look for files ending with .c in all directories; only in current directory, and any subdirectories (or their subdirectories) that all have names ending with .c.

Since only directories match a trailing slash, **/ matches all subdirectories and their subdirectories and so on, recursively.

To use the arrays thus generates, you must use proper quotes to keep whitespace intact:
Code:
command "${files[@]}"
Otherwise, Bash would take the array, expand it, then re-split the expanded string. You don't want to do that, so do remember the double quotes.

The second method is to use GNU find to list the desired files using ASCII NULs as separators. You can either use the -print0 option to print the full paths to the files, or -printf 'pattern\0' to fine-tune the output. The latter is especially useful if you want e.g. the modification timestamps of the files, too.

Because the right side of a pipe | is always a subshell, and changes to Bash variables in a subshell do not propagate to the parent, you'll want to use input redirection. Here is a safe example:
Code:
# Make sure we don't abort on non-UTF-8 byte sequences.
test -v LANG   && OLD_LANG="$LANG"     || unset OLD_LANG   ; LANG=C
test -v LC_ALL && OLD_LC_ALL="$LC_ALL" || unset OLD_LC_ALL ; LC_ALL=C

# Clear the files array
files=()

while [ 1 ]; do
    file=""
    IFS="" read -rd "" file || [ -n "$file" ] || break

    files[${#files[@]}]="$file"
done < <( find . -type f -print0 )

#
# Do something with the file paths in the files array
#

# Revert to original locale.
test -v OLD_LANG   && LANG="$OLD_LANG"     || unset LANG
test -v OLD_LC_ALL && LC_ALL="$OLD_LC_ALL" || unset LC_ALL
There is a difference between a value in the LANG and LC_ALL variables, even an empty one, and them being unset. The first and last four lines are the way to retain them in their exact state, after the loop.

In the loop, we clear the file variable, so we can detect whether or not read did read a record. Then, we clear the IFS temporarily, so that Bash will not trim the file name, and read the record. If the record does not have a final ASCII NUL, read will return false, but file will not be empty. This stanza breaks out of the loop only when there is no more data to read.

We add each file variable to the files array.

I use this GNU find -based approach, very often. You can even prepend the file names with other useful data, and extract them using Bash string operations. For example:
Code:
# Make sure we don't abort on non-UTF-8 byte sequences.
test -v LANG   && OLD_LANG="$LANG"     || unset OLD_LANG   ; LANG=C
test -v LC_ALL && OLD_LC_ALL="$LC_ALL" || unset OLD_LC_ALL ; LC_ALL=C

# count is the number of files,
# names contains the file names (indices 0 to count-1)
# dates contains the modification dates,
# sizes the file sizes in bytes.
count=0
names=()
dates=()
sizes=()

while [ 1 ]; do
    item=""
    IFS="" read -rd "" item || [ -n "$item" ] || break

    time="${item#* }"    # item with first word removed
    date="${item%% *}"   # first word of item
    size="${time#* }"    # time with first word removed
    time="${time%% *}"   # first word of time (second word of item)
    name="${size#* }"    # size with first word removed
    size="${size%% *}"   # first word of size (third word of item)

    # date=YYMMDD
    # time=HHMMSS or HHMMSS.fractions
    # size=file size in bytes

    # To skip adding this file, use 'continue'

    # Add this to the arrays
    names[count]="$name"
    dates[count]="$date"
    sizes[count]="$size"
    count=$[count+1]
done < <( find . -type f -printf '%TY%Tm%Td %TT %s %p\0' )

#
# Do something with the names, dates, and sizes arrays,
# indices 0 (first file) to count-1 (last file).
#

# Revert to original locale.
test -v OLD_LANG   && LANG="$OLD_LANG"     || unset LANG
test -v OLD_LC_ALL && LC_ALL="$OLD_LC_ALL" || unset LC_ALL
Note that although I used space as the separator in the find output, I could just as well have used /, |, or just about any other character.

Don't use ! as a separator, though. set -H enables history substitution with !, and is the default. When enabled, a ! even in a double-quoted string is special to Bash. Also, ? and * are globbing characters, and must be escaped using \ if used in a string manipulation operation. In general, ! \ ? $ * [ ] { } ( ) should be avoided in string operations, because they're difficult to remember to escape correctly.)

Finally, GNU awk (gawk) can also process ASCII NUL -delimited data, but it is a GNU extension, and not supported by all awks in general. Just set RS="\0" and optionally FS="\0" too if you don't need field splitting. You'll want to set the locale explicitly to POSIX (using LANG=C LC_ALL=C awk ...) or it will abort if it encounters a non-UTF-8 sequence in an UTF-8 locale.
Posted in Uncategorized
Views 2247 Comments 0
« Prev     Main     Next »
Total Comments 0

Comments

 

  



All times are GMT -5. The time now is 11:46 PM.

Main Menu
Advertisement
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration