LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   [bash] find | grep optimization (https://www.linuxquestions.org/questions/programming-9/%5Bbash%5D-find-%7C-grep-optimization-844815/)

hashbang#! 11-17-2010 07:19 AM

[bash] find | grep optimization
 
My recent recent optimization question seemed to have generated some interest.

So I thought I'd post my findings on another case for discussion. (I hope this is appropriate on this forum.)

Scenario
A folder contains about 280,000 html files, all residing in monthly subfolders and following the same naming convention. Id's are numeric and of varying length.
Code:

data/<yyyy>/<mm>/event_<id>.html
I need to build up a file just containing the files' id's excluding a particular monthly folder.

First attempt time: 4.6s
Code:

find data/ -path data/1975/01/ -prune -o \( -type f -printf '%f\n' \) | \
grep -E -o -e [0-9]*


I then went back to basics adding to the above options to the command sequence to see where the time was spent.

find data -type f only takes an amazing 0.22s.
find data/ -path data/1975/01/ -prune -type f: 0.45s
find data/ -path data/1975/01/ -prune -o \( -type f -printf '%f\n' \): 0.6s

Clearly, the post-processing with grep used most of the time.
Optimization 1 time: 3.1s
Code:

find ... | grep -E -o -e [0-9]+

Optimization 2 time: 0.7s
Code:

find ... | cut -d '_' -f 2| cut -d '.' -f 1
This obsoleted -printf: down to 0.61s

Optimization 3 time: 0.51s
Code:

find data -type f | grep -F -v "data/1975/01/" |  cut ...| cut ...

Unfortunately, the id's vary in length; otherwise I could have done it with one cut -c .



And just so we have a question here: how come cut is so efficient?

tuxdev 11-18-2010 06:15 PM

cut is efficient because it's not very powerful. It can't do nearly the same sort of stuff grep is capable of.

Actually, I'd personally wouldn't use cut, but rather bash's built in functionality that does the same sort of thing.
Code:

while read file ; do
  id="${file%.html}"
  id="${id##*_}"
  echo "$id"
done <(find ... | grep ...)

No clue how fast it is, though I suspect it isn't that different from your final version.

hashbang#! 11-19-2010 03:59 AM

Quote:

Originally Posted by tuxdev (Post 4163686)
Actually, I'd personally wouldn't use cut, but rather bash's built in functionality that does the same sort of thing. [...]

(still need redirection in addition to process substitution)
Code:

while read file ; do
  id="${file%.html}"
  id="${id##*_}"
  echo "$id"
done < <(find ... | grep ...)

Quote:

Originally Posted by tuxdev (Post 4163686)
No clue how fast it is, though I suspect it isn't that different from your final version.

time: 27s (compared to find | grep | cut | cut @ 0.6s)
I know the double cut looks a bit of a cludge ...

catkin 11-19-2010 04:40 AM

27s or 0.27s?

hashbang#! 11-19-2010 04:47 AM

Quote:

Originally Posted by catkin (Post 4164154)
27s or 0.27s?

27s! Otherwise I would have congratulated tuxdev!

catkin 11-19-2010 05:06 AM

My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:
Code:

while read file ; do
  id="${file%.html}"
  id="${id##*_}"
  echo "$id"
done <<< "$(find ... | grep ...)"


hashbang#! 11-19-2010 05:51 AM

Quote:

Originally Posted by catkin (Post 4164182)
how does this variant perform:
Code:

while read file ; do
    [...]
done <<< "$(find ... | grep ...)"


31s

I have played around with bash loops on similar situations and found them to be slower than any built-in recursion/looping.

grail 11-19-2010 08:42 AM

What about taking the grep out and doing something like:
Code:

while IFS='_' read -r _ line
do
    echo ${line%.html}
done< <(find data/ -path data/1975/01/ -prune -o -type f -regex '.*[0-9]+.html' -printf "%f\n")


hashbang#! 11-19-2010 10:45 AM

Quote:

Originally Posted by grail (Post 4164383)
What about taking the grep out and doing something like:
Code:

while IFS='_' read -r _ line
do
    echo ${line%.html}
done< <(find data/ -path data/1975/01/ -prune -o -type f -regex '.*[0-9]+.html' -printf "%f\n")


The find on its own takes 3.86s. With the while loop it takes 27s.

If you look at my original post, I took prune and printf out to save time.

find -regex is slower than find | egrep (see post #1, Optimization 1). Also, bear in mind that grep -o already produced the desired output (id only).

ntubski 11-19-2010 10:56 AM

Quote:

Originally Posted by catkin (Post 4164182)
My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:

Come on guys, it's the while loop that's slow. Except for small operations, using external programs is always going to be faster than doing it in bash.

I was surprised that the -prune wasn't faster, but I think it's because there's only 1 directory to prune so it doesn't save much. And since -path takes glob patterns it's slower than grep -F which take fixed strings.

catkin 11-19-2010 01:04 PM

Quote:

Originally Posted by ntubski (Post 4164520)
Come on guys, it's the while loop that's slow. Except for small operations, using external programs is always going to be faster than doing it in bash.

Is it the while loop or the read? If the read is the underlying cause that bash doesn't buffer stdin (that would be a strange design choice)?

ntubski 11-19-2010 02:06 PM

Quote:

Originally Posted by catkin (Post 4164632)
Is it the while loop or the read? If the read is the underlying cause that bash doesn't buffer stdin (that would be a strange design choice)?

Well looking at read.def I see
Code:

unbuffered_read = (nchars > 0) || (delim != '\n') || input_is_pipe;
So it appears that read actually isn't buffering. I'm not certain, the code deals with a lot of different options so it's hard to follow.

However, I don't think it matters either way, here is a while loop without read:
Code:

#!/bin/bash
COUNT=40000 # maybe increase this if your computer is faster than mine.

echo use bash loop
declare -i i=0
time while ((i<COUNT)) ; do ((i++)) ; done

echo ========
echo use seq
time seq $COUNT >/dev/null

For me the while loop takes around 1.3 seconds and the seq takes 0.1 seconds. bash is over 10 times as slow and that's after I took out the echo statement.

catkin 11-19-2010 09:39 PM

Those are very educational figures, ntubski, and show the importance of measurement instead of blindly accepting received wisdom. For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.

On my system the output from your script was:
Code:

use bash loop

real    0m0.426s
user    0m0.405s
sys    0m0.010s
========
use seq

real    0m0.071s
user    0m0.041s
sys    0m0.004s

I modified it to also test an equivalent for ((i=0; i<COUNT; i++)) loop and found it ~10% faster than the while loop. Then I added a call to /bin/true in the loop and found the /bin/true calls took ~0.0001 s each which is only ~100 times longer than each bare loop.

Presumably there is some caching and hashing which means the first call to /bin/true could have taken significantly longer than 0.0001 s.

ntubski 11-20-2010 11:41 AM

Quote:

Originally Posted by catkin (Post 4164978)
For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.

Well they are, but in the example here there are O(n) in-shell actions but O(1) external calls, so for large n the fork+exec overhead is negligible.


All times are GMT -5. The time now is 08:11 PM.