[SOLVED] REALLY Slow Shell Script

Crowey · 03-01-2010, 07:35 PM

Gidday, sorry for ignorance, but I'm a truly ordinary shell scripter (and unfortunately I know nothing about any higher-level language), but I have an important script to run to count the number of files in our client directories.

Problem is that while we've not got a lot of clients (circa 100), they might have a LOT (> 1000s) of files.

And my script is REALLY slow:

# Set my main variables
MYDATE=$(date '+%Y%m%d')
MYTMP01=/tmp/comms01.txt
MYTMP02=/tmp/comms02.txt
MYLOGDIR=/Data/Software/LOGS/Daisy/
MYLOGFILE=Comms_Assets_"$MYDATE".csv

function clientsearch ()
{
find /Data/WIP/Comms/*/ -maxdepth 1 -type d -iname "_Assets"
}

function assetssearch ()
{
find . -type f -iname "*" -exec ls -1 {} \;
}

clientsearch | while read MYDUMMY

do
cd "$MYDUMMY"
assetssearch > $MYTMP01
grep -Evi '(.hsresource|.ds_store|.hsancillary|.hsicon|.hsxmap|thumbs.db)' $MYTMP01 > $MYTMP02
MYCOUNT=`\cat $MYTMP02 | wc -l`
echo $PWD,$MYCOUNT >>$MYLOGDIR$MYLOGFILE
done

Any advice as to how I might improve the efficiency of this script would be greatly appreciated.

PS How slow? Well I started a cron job at 0005 yesterday, and as of 0930 today its only a little over half-way through all our clients.

Cheers
Crowey

tuxdev · 03-01-2010, 08:28 PM

Code:

MYDATE=$(date '+%Y%m%d')
MYTMP01=/tmp/comms01.txt
MYTMP02=/tmp/comms02.txt
MYLOGDIR=/Data/Software/LOGS/Daisy/
MYLOGFILE=Comms_Assets_"$MYDATE".csv

I generally use lowercase var names and never, ever prefix "MY" all over the place. Of *course* it's yours.

Code:

function clientsearch ()
{
find /Data/WIP/Comms/*/ -maxdepth 1 -type d -iname "_Assets"
}

function assetssearch ()
{
find . -type f -iname "*" -exec ls -1 {} \;
}

Just put these lines where you actually use them. The -exec on assetsearch creates a ton of useless processes, and the -iname has no effect.

Code:

clientsearch | while read MYDUMMY

do
...
done

This does not handle newlines in filenames correctly. Anyway, considering what "clientsearch" does, it's simpler to do

Code:

for dir in /Data/WIP/Comms/*/_Assets/ ; do
   ...
done

Code:

cd "$MYDUMMY"
assetssearch > $MYTMP01
grep -Evi '(.hsresource|.ds_store|.hsancillary|.hsicon|.hsxmap|thumbs.db)' $MYTMP01 > $MYTMP02
MYCOUNT=`\cat $MYTMP02 | wc -l`
echo $PWD,$MYCOUNT >>$MYLOGDIR$MYLOGFILE

You're doing a lot of acrobatics with temp files that isn't even necessary, find handles most of it:

Code:

tally="$(find "$dir" -name ".hsresource" -o -name ".ds_store" -o -name ".hsancillary" -o -name ".hsicon" -o -name ".hsxmap" -o -name "thumbs.db" -type f -exec printf "x" ";")"
echo "$PWD,${#tally}" >> "$log"

http://mywiki.wooledge.org/BashGuide

Crowey · 03-01-2010, 08:57 PM

Quote:

Originally Posted by tuxdev

Code:

cd "$MYDUMMY"
assetssearch > $MYTMP01
grep -Evi '(.hsresource|.ds_store|.hsancillary|.hsicon|.hsxmap|thumbs.db)' $MYTMP01 > $MYTMP02
MYCOUNT=`\cat $MYTMP02 | wc -l`
echo $PWD,$MYCOUNT >>$MYLOGDIR$MYLOGFILE

You're doing a lot of acrobatics with temp files that isn't even necessary, find handles most of it:

Code:

tally="$(find "$dir" -name ".hsresource" -o -name ".ds_store" -o -name ".hsancillary" -o -name ".hsicon" -o -name ".hsxmap" -o -name "thumbs.db" -type f -exec printf "x" ";")"
echo "$PWD,${#tally}" >> "$log"

http://mywiki.wooledge.org/BashGuide

Mate, awesome stuff, thank you.

Are you saying (and yes I admit to being a bit dense and a slow-learner with this stuff), that my script can be cut down to this:

Code:

for dir in /Data/WIP/Comms/*/_Assets/ ; do
    tally="$(find "$dir" -name "*" -type f -exec printf "x" ";")"
    echo "$PWD,${#tally}"
done

I have two problems with that (and I fully admit that I've probably read/interpreted you wrong) ... firstly, all those types I originally calling with grep were exceptions - I didn't want to count any file, or file in a directory, that started with .hsresource (a directory), .ds_store, .hsancillary, .hsicon, .hsxmap or thumbs.db (all files I think)

And secondly, the $PWD is doing nothing, and I was originally using that so I knew what client had the file/asset count - ignore, I replaced $PWD with $dir and that fixed that!

Just the above to fix ...

But, boy, if what I've interpreted was right - its SO much faster than my stuff! So if you've any tips to incorporate the above points, then that would be greatly appreciated.

KenJackson · 03-01-2010, 09:01 PM

Quote:

Originally Posted by tuxdev

Code:

tally="$(find "$dir" -name ".hsresource" -o -name ".ds_store" -o -name ".hsancillary" -o -name ".hsicon" -o -name ".hsxmap" -o -name "thumbs.db" -type f -exec printf "x" ";")"
echo "$PWD,${#tally}" >> "$log"

Ah! That's clever, gathering up 'x's and counting them with ${#..}. But you're still using find.

I was wondering about this.

Code:

COUNT="$(ls *.hsresource *.ds_store *.hsancillary *.hsicon *.hsxmap thumbs.db|wc -l)"

Of course we are both assuming that there are no mixed case filenames. The original script used the -i switch on grep to match case insensitivity.

KenJackson · 03-01-2010, 09:07 PM

Quote:

Originally Posted by Crowey

... firstly, all those types I originally calling with grep were exceptions - I didn't want to count any file, or file in a directory, that started with .hsresource (a directory), .ds_store, .hsancillary, .hsicon, .hsxmap or thumbs.db (all files I think)

Oops. You're right. -v on grep removes it's arguments.

So the little chunk I just did would be:

Code:

COUNT="$(ls |grep -Evi '.hsresource|.ds_store|.hsancillary|.hsicon|.hsxmap|thumbs.db'|wc -l)"

Crowey · 03-01-2010, 09:09 PM

Quote:

Originally Posted by KenJackson

Of course we are both assuming that there are no mixed case filenames. The original script used the -i switch on grep to match case insensitivity.

Yes, and case insensitivity is important in our infrastructure. And I think you both missed the v switch with grep too - doesn't that mean to ignore lines with those directories/files in it?

Also, there are many directories under the <client>/_Assets directory - but the updated fix handled that fine, I don't think ls (on its own) would.

But I'm grateful to you both for contributing!

Cheers
Crowey

tuxdev · 03-01-2010, 10:32 PM

ah, then try this instead:

Code:

tally="$(find "$dir" ! \( -name "*.hsresource" -o -name "*.ds_store" -o -name "*.hsancillary" -o -name "*.hsicon" -o -name "*.hsxmap" -o -name "thumbs.db" \) -type f -exec printf "x" \;)"

You can use "shopt -s nocaseglob" for case-insensitive glob patterns (like the one in the for)

http://mywiki.wooledge.org/UsingFind

gnashley · 03-02-2010, 02:44 AM

If you must use find, then use xargs instead of the '-exec' option to fins. It seems that the internal '-exec' would be faster, but my projects show that using xargs is *way* faster.

H_TeXMeX_H · 03-02-2010, 02:59 AM

I don't quite understand exactly what you want to do, and why one find command can't do it. But, 'find' usually takes a while to run because it has to find many files, if you want to speed that up I would use a command that keeps an index, something like slocate, that would probably be the easiest way of speeding it up, especially if these clients only add or change a few files once in a while. A single run of 'find' might be another option, then parse that for the info you need, that should be faster than running find multiple times.

Crowey · 03-02-2010, 07:02 PM

Thank you all, but especially TuxDev & KenJackson, this is what I've ended up with and it seems to work REALLY well (its super fast!)

Code:

# Set my main variables
date=$(date '+%Y%m%d')
logdir=/Data/Software/LOGS/
logfile=Company_Assets_"$date".csv

for dir in /Data/WIP/_*/*/_Assets/ /Data/WIP/_Press/_Templates\ and\ Styles/ /Data/Govt/*/ ; do
    tally="$(find "$dir" \( ! -regex '.*/\..*' \) -type f ! -iname "thumbs.db" -exec printf "x" ";")"
    echo "$dir,${#tally}" >>$logdir$logfile
done

I'm still not sure exactly how, or why, it works - but work it does.

My original miserable attempt took nearly two days, but this version literally ran in a fraction under two minutes!

So, again, thank you very much for all your help.

Cheers
Crowey