Quote:
Originally Posted by David the H.
I'd like to hear if anyone else can confirm the behavior I'm getting (I'm using the debian-supplied 4.2.20(1)-release version).
|
I happened to have a Scientific Linux 6.2 virtual machine running, with
Bash-4.1.2(1)-release (x86_64-redhat-linux-gnu).
First, I use a helper function to create the test trees:
Code:
# mkdirs DEPTH DIR(S)...
#
function mkdirs() {
local depth=$[ $1-1 ]
local dir=""
shift 1
[ $depth -ge 0 ] || return 0
for dir in "$@" ; do
(
mkdir "$dir" || exit $?
cd "$dir" || exit $?
mkdirs $[depth] "$@" || exit $?
) || return 1
done
return 0
}
It creates the directories recursively, to the desired depth.
Using
Code:
mkdirs 7 dir-one dir-two dir-three dir-four dir-five
you create a 97655-directory tree, with five entries at each level in each subtree. It does each directory separately, so it takes a few minutes to run. (Note, you don't really need to to run the other tests.)
Recursive globbing,
Code:
shopt -s globstar
dirlist=(**/)
has no problems at all. It takes just a couple of seconds. You can use it with any builtins without a hitch, for example:
Code:
printf '%s' "${dirlist[@]}" | wc -c
5800784
which takes only a couple of seconds, too. This means there seems to be nothing wrong in recursive globbing, as long as you use builtins only, and the entire list.
If you take the above as given, then you can continue with just the
dirlist array, which you can synthesize in just a few seconds using
Code:
dirlist=()
list="dir-one dir-two dir-three dir-four dir-five"
for D1 in $list ; do
for D2 in $list ; do
for D3 in $list ; do
for D4 in $list ; do
for D5 in $list ; do
for D6 in $list ; do
for D7 in $list ; do
dirlist[${#dirlist[@]}]="$D1/$D2/$D3/$D4/$D5/$D6/$D7"
done
dirlist[${#dirlist[@]}]="$D1/$D2/$D3/$D4/$D5/$D6"
done
dirlist[${#dirlist[@]}]="$D1/$D2/$D3/$D4/$D5"
done
dirlist[${#dirlist[@]}]="$D1/$D2/$D3/$D4"
done
dirlist[${#dirlist[@]}]="$D1/$D2/$D3"
done
dirlist[${#dirlist[@]}]="$D1/$D2"
done
dirlist[${#dirlist[@]}]="$D1"
done
Now, if you
use the dirlist in any way, say count the total length of the directory names (the trivial equivalent of the above):
Code:
function totallen() {
local total=0
while [ $# -gt 0 ]; do
local string="$1"
shift 1
total=$[total+${#string}]
done
echo $total
}
then expect to sit and wait. I ran on a bit smaller dirlist with 82030 directories in it:
Code:
time totallen "${dirlist[@]}"
4913284
real 17m24.167s
user 17m12.304s
sys 0m1.106s
Yup, that is seventeen minutes. It boils down to about 80 entries per second, just for summing the parameter string lengths.
On Bash-4.1.2(1)-release (x86_64-redhat-linux-gnu), slicing slow. Echoing the second set of five names (i.e. sixth to tenth entries in the array, remember indices start at zero in Bash):
Code:
time echo "${dirlist[@]:5:5}"
dir-one/dir-one/dir-one/dir-one/dir-one/dir-one dir-one/dir-one/dir-one/dir-one/dir-one/dir-two/dir-one dir-one/dir-one/dir-one/dir-one/dir-one/dir-two/dir-two dir-one/dir-one/dir-one/dir-one/dir-one/dir-two/dir-three dir-one/dir-one/dir-one/dir-one/dir-one/dir-two/dir-four
real 0m0.357s
user 0m0.357s
sys 0m0.001s
Not too bad, except if you do any serious work with arrays in Bash, it will be treacle-fast.
However, on Bash-4.2.10(1)-release x86_64-pc-linux-gnu, it .. it takes ages:
Code:
time echo "${dirlist[@]:5:5}"
dir-one/dir-one/dir-one/dir-one/dir-one/dir-one dir-one/dir-one/dir-one/dir-one/dir-one/dir-two/dir-one dir-one/dir-one/dir-one/dir-one/dir-one/dir-two/dir-two dir-one/dir-one/dir-one/dir-one/dir-one/dir-two/dir-three dir-one/dir-one/dir-one/dir-one/dir-one/dir-two/dir-four
real 1m38.657s
user 1m38.598s
sys 0m0.064s
Either RedHat has applied a patch to Bash which greatly improves the array slicing speed, or there is a severe regression in it between Bash-4.2.10 compared to Bash-4.1.2.
While I didn't check for memory leaks, I think the above indicates the real issue is the slow string handling, and insanely slow array handling, in Bash. If you do array slicing or access in a loop, it will look like the loop has frozen, simply because it works so slow.
On Bash-4.2.10(1)-release x86_64-pc-linux-gnu, don't use large arrays, or simply referencing an array member takes a significant fraction of a
minute!
Edited to add: A for loop is not too bad:
Code:
len=0
for dir in "${dirlist[@]}" ; do
len=$[len+${#dir}]
done
echo $len
executes in a few seconds, too, so apparently for loops don't suffer that much. Hey, this group slicing method -- for Bash-internal xargs -like processing, for example -- seems to work, too:
Code:
perslice=5
slice=()
for dir in "${dirlist[@]}" ; do
if [ ${#slice[@]} -ge $perslice ]; then
# Do something with "${slice[@]}"
slice=()
fi
slice[${#slice[@]}]="$dir"
done
if [ ${#slice[@]} -gt 0 ]; then
# Do something with leftovers "${slice[@]}"
fi
which takes only a few seconds, sans the
something that works on the slices. So there is a workaround, it seems, to Bash string/array weaknesses, here.