[bash] find | grep optimization
My recent recent optimization question seemed to have generated some interest.
So I thought I'd post my findings on another case for discussion. (I hope this is appropriate on this forum.) Scenario A folder contains about 280,000 html files, all residing in monthly subfolders and following the same naming convention. Id's are numeric and of varying length. Code:
data/<yyyy>/<mm>/event_<id>.html First attempt time: 4.6s Code:
find data/ -path data/1975/01/ -prune -o \( -type f -printf '%f\n' \) | \ Optimization 1 time: 3.1s Code:
find ... | grep -E -o -e [0-9]+ Optimization 2 time: 0.7s Code:
find ... | cut -d '_' -f 2| cut -d '.' -f 1 Optimization 3 time: 0.51s Code:
find data -type f | grep -F -v "data/1975/01/" | cut ...| cut ... Unfortunately, the id's vary in length; otherwise I could have done it with one cut -c . And just so we have a question here: how come cut is so efficient? |
cut is efficient because it's not very powerful. It can't do nearly the same sort of stuff grep is capable of.
Actually, I'd personally wouldn't use cut, but rather bash's built in functionality that does the same sort of thing. Code:
while read file ; do |
Quote:
(still need redirection in addition to process substitution) Quote:
I know the double cut looks a bit of a cludge ... |
27s or 0.27s?
|
Quote:
|
My gast is flabbered; I had expected tuxdev's solution to be a little quicker, not almost an order of magnitude slower! Looking for what causes such poor performance, how does this variant perform:
Code:
while read file ; do |
Quote:
I have played around with bash loops on similar situations and found them to be slower than any built-in recursion/looping. |
What about taking the grep out and doing something like:
Code:
while IFS='_' read -r _ line |
Quote:
If you look at my original post, I took prune and printf out to save time. find -regex is slower than find | egrep (see post #1, Optimization 1). Also, bear in mind that grep -o already produced the desired output (id only). |
Quote:
I was surprised that the -prune wasn't faster, but I think it's because there's only 1 directory to prune so it doesn't save much. And since -path takes glob patterns it's slower than grep -F which take fixed strings. |
Quote:
|
Quote:
Code:
unbuffered_read = (nchars > 0) || (delim != '\n') || input_is_pipe; However, I don't think it matters either way, here is a while loop without read: Code:
#!/bin/bash |
Those are very educational figures, ntubski, and show the importance of measurement instead of blindly accepting received wisdom. For years I have accepted the plausible hypothesis, perhaps once true, that the fork+exec system calls to run an external program are slow relative to in-shell actions.
On my system the output from your script was: Code:
use bash loop Presumably there is some caching and hashing which means the first call to /bin/true could have taken significantly longer than 0.0001 s. |
Quote:
|
All times are GMT -5. The time now is 08:11 PM. |