[SOLVED] Using a pipe inside 'find ... -exec' ... how does it work, and why is it so slow?

bartonski · 06-26-2010, 01:30 PM

celembor started a thread here asking how to get disk usage for all files named '*.mp3' on his file system.

druuna and I collectively came up with this:

Code:

find / -type f -iname '*.mp3' -print0 | xargs -0 du -ach

This will do a reasonable job of finding all files named '*.mp3'... but what happens if the a user decides to be sneaky, and renames the files to avoid detection? To find this, we'll need to use the 'file' command.

'file' contains a database of identifying byte patterns, also known as 'magic numbers', and searches for these patterns within a file of unknown type. It returns a text string which should tell us what type of file it is. for example:

Code:

$ file 203.mp3 
203.mp3: Audio file with ID3 version 2.2.0, contains: MPEG ADTS, layer III, v1,  64 kbps, 44.1 kHz, Monaural

I think that any file for which file returns the pattern 'MPEG ... layer III' will be an mp3 file.

I'll test this:

Code:

$ file 203.mp3 | grep 'MPEG.*layer III' > /dev/null && echo 'this is an mp3 file!'
this is an mp3 file!

Ok... so now that I have a test that will return 'true' when a file is an mp3 file, I would like to use this within a find command.

the 'exec' test will execute a command on a given file. It is naively used to execute a command on files which match the preceding tests within a find, for example let's say that I want to copy all mp3 files newer than './This American Life/203.mp3' to an mp3 player that I have mounted at '/media/mp3-player'

I could run this command:

Code:

find . -type f  -newer './This American Life/203.mp3' -iname '*.mp3' -exec cp {} /media/mp3-player \;

This not the best solution, however, because it spawns a new process to copy each file.

better to use xargs instead:

Code:

find . -type f  -newer './This American Life/203.mp3' -iname '*.mp3' -print0 | xargs -0 -I {} cp {} /media/mp3-player

So why use the 'exec' test? ... use it as an actual test within find.

Let's say that you've broken up with your girlfriend 'Brunhilde', and you want to delete all text files containing her name, without the pain of having to go back and read them all...

find . -type f -exec grep -qi 'brunhilde' {} | xargs rm -f

(don't do this if you want to keep your collection of Wagner lyrics intact... as a matter of fact, don't do this at all).

Ok... so, getting back around to the original question...

I want to use file {} | grep 'MPEG.*layer III' as my test ... it will return true, so it should work within find ... -exec...

Unfortunately this fails horribly:

Code:

find . -type f -exec file {} | grep 'MPEG.*layer III' \; -print0 | xargs du -ach

because the pipe breaks the find command in to two expressions.

so I went googling, and found this

I tried it out...

Code:

find . -type f -exec sh -c 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' {} {} \; -print0 | xargs -0 du

It's horribly slow. As a matter of fact, it's so slow that when I started to write this post, I was writing about how it was hanging, and trying to figure out what was wrong with it. I was running it across my Music directory, which is a 1.9 G directory containing 507 files, with 3 mp3 files in it.

First question: why does it work? I can see that I'm calling 'sh -c' which is executing 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' with the argument '{}', which is how find -exec expresses the matching files... but what is the second '{}' for?

Second, how to increase performance?

I ran the following to figure out whether 'file' was causing the bottleneck

Code:

$ find . -type f -print0 | xargs -0 file
./SALT - Seminars About Long Term Thinking/podcast-2010-06-16-moses.mp3:                                                                        Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1,  64 kbps, 44.1 kHz, Monaural

...

./debra_music/Tesla - Paradise (Acoustic).m4a:                                                                                                  ISO Media, MPEG v4 system, version 2
./MN0035426.gif:                                                                                                                                GIF image data, version 89a, 300 x 400
./3807510_01.jpg:                                                                                                                               JPEG image data, JFIF standard 1.01
./This American Life/203.mp3:                                                                                                                   Audio file with ID3 version 2.2.0, contains: MPEG ADTS, layer III, v1,  64 kbps, 44.1 kHz, Monaural

real    0m0.386s
user    0m0.288s
sys    0m0.036s

real time is under half a second, so this isn't causing the problem.

I think that sh -c 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' {} is going to spawn 3 processes per file: one for 'sh -c', one for 'file $1' and one for the grep... is the overhead for spawning 1500 processes going to be that big?

I waited over 25 minutes (1500 seconds), and it still hadn't finished... that would mean over a second per spawned process, which doesn't sound right at all. I also figured that maybe something might be handed to 'du' that might make it run forever (I'm expecting that it would only be handed files, but maybe I made a mistake, and it's running du across the file system multiple times)... so I piped the xargs to 'echo', a second time around, and that ran slow...

I started chopping things out...

Code:

time find . -type f -exec sh -c 'echo $1 > /dev/null' {} {} \; -print0 | xargs -0 echo

runs in under a second.

Code:

time find . -type f -exec file {} \;

runs in a second and a half

but

Code:

time find . -type f -exec sh -c 'file $1' {} {} \; -print0 | xargs -0 echo

hangs on me... I don't get it.

crts · 06-26-2010, 07:17 PM

Hi,

that's an interesting problem. After some research I found the following in the man page:

Quote:

sh -c [some options...] command_string [command_name [argument ...]]
...
If command line arguments besides the options have been specified, then the shell treats the first argument as the
name of a file from which to read commands (a shell script), and the remaining arguments are set as the positional
parameters of the shell ($1, $2, etc). Otherwise, the shell reads commands from its standard input.
...
-c Read commands from the command_string operand instead of from the standard input. Special parameter 0 will be set from the command_name operand and the positional parameters ($1, $2, etc.) set from the remaining argument operands.
...

This sounds more complicated than it actually is. In fact, it is plain simple. A little example will clarify this (hopefully):
Suppose you have a traditional script, like

Code:

#!/bin/sh
echo $1
echo $0

Now call this script with

Code:

./script positional_param_one

This is equivalent to

Code:

sh -c 'echo $1; echo $0' script positional_param_one

So the first {} is your "script name". You can change it to anything you want, like

Code:

find . -type f -exec sh -c 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' whatever {} \; -print0 | xargs -0 du

The second {} is the important one.

As for the slowness, as you already stated the exec option will spawn a new process (shell) for every result. Now I assume (I'm really not sure about this) that spawning a shell might take some more overhead than spawning a normal command. On top of that you execute a pipe with sh which produces another subshell. Even without the pipe I assume there would be at least two processes spawned - one for sh and another one for the command executed by sh. And I guess that determining the filetype by examining its structure - as done by file - takes its toll, too. Again, this is only an assumption by me. If after conducting some further tests you come to another conclusion I'd appreciate it if you shared your insights.

bartonski · 06-26-2010, 08:28 PM

Thanks for the explanation of the positional parameters... that made perfect sense.

Quote:

Originally Posted by crts

As for the slowness, as you already stated the exec option will spawn a new process (shell) for every result. Now I assume (I'm really not sure about this) that spawning a shell might take some more overhead than spawning a normal command. On top of that you execute a pipe with sh which produces another subshell. Even without the pipe I assume there would be at least two processes spawned - one for sh and another one for the command executed by sh. And I guess that determining the filetype by examining its structure - as done by file - takes its toll, too. Again, this is only an assumption by me. If after conducting some further tests you come to another conclusion I'd appreciate it if you shared your insights.

I think that it's some weird interaction between 'find -exec', 'file' and 'sh -c'. This hangs:

Code:

find . -type f -exec sh -c 'file $1' {} {} \; -print0 | xargs -0 echo

I've removed the pipes and re-direction, so I've cut down the number of processes by half (I think). Furthermore, I let this run for an hour and a half on the same 508 files. I don't think that this is just a matter of some overhead.

If I remove the wrapper of 'sh -c' around the call to file,

Code:

time find . -type f -exec file {} \;

This runs in under a second and a half.

Going the other direction by replacing the call to 'file' with 'echo'

Code:

time find . -type f -exec sh -c 'echo $1 > /dev/null' {} {} \; -print0 | xargs -0 echo

runs in under a second... so it's neither the call to 'file' or the use of 'sh -c' alone that causes the hang. I'm not sure what to test next...

bartonski · 06-26-2010, 08:41 PM

Just to prove Finagle's law, this works just fine...

Code:

$ sh -c 'file $1 | grep -q "MPEG.*layer III"' {} 203.mp3 && echo "TRUE"
TRUE

crts · 06-26-2010, 09:27 PM

Hi,
try the following

Code:

time find . -type f -exec sh -c 'file "$1"' {} {} \; -print0 | xargs -0 echo

As it appears the problem was with 'file $1' having trouble to cope with whitespaces in filenames. After throwing tons of errors it came to a point where it just hung up doing nothing, leaving the shell completely irresponsive. Since you were operating on mp3 files I bet you had whitespace characters in there, too.
But I do not know why it hung up after throwing hundreds of errors? Maybe a bug in file? It should be able to at least cope with input that it cannot process in some way that won't render the system useless.

[EDIT]
Right now I am still running the command. The output is 'chunkwise', i.e. it hangs for a couple of seconds and then outputs a couple of screens. I think this has something to do with the buffer from xargs. While investigating this problem I encountered an error like:
'xargs: arguments line too long'
or something like that. Don't remember exactly. I'll let you know how it went when it finishes.

[UPDATE]
Ok, it finished now without any errors. Approximately 30,000 files were processed in under 15 minutes.

bartonski · 06-27-2010, 01:38 AM