celembor started a thread
here asking how to get disk usage for all files named '*.mp3' on his file system.
druuna and I collectively came up with this:
Code:
find / -type f -iname '*.mp3' -print0 | xargs -0 du -ach
This will do a reasonable job of finding all files
named '*.mp3'... but what happens if the a user decides to be sneaky, and renames the files to avoid detection? To find this, we'll need to use the 'file' command.
'file' contains a database of identifying byte patterns, also known as 'magic numbers', and searches for these patterns within a file of unknown type. It returns a text string which should tell us what type of file it is. for example:
Code:
$ file 203.mp3
203.mp3: Audio file with ID3 version 2.2.0, contains: MPEG ADTS, layer III, v1, 64 kbps, 44.1 kHz, Monaural
I think that any file for which file returns the pattern 'MPEG ... layer III' will be an mp3 file.
I'll test this:
Code:
$ file 203.mp3 | grep 'MPEG.*layer III' > /dev/null && echo 'this is an mp3 file!'
this is an mp3 file!
Ok... so now that I have a test that will return 'true' when a file is an mp3 file, I would like to use this within a find command.
the 'exec' test will execute a command on a given file. It is naively used to execute a command on files which match the preceding tests within a find, for example let's say that I want to copy all mp3 files newer than './This American Life/203.mp3' to an mp3 player that I have mounted at '/media/mp3-player'
I could run this command:
Code:
find . -type f -newer './This American Life/203.mp3' -iname '*.mp3' -exec cp {} /media/mp3-player \;
This not the best solution, however, because it spawns a new process to copy each file.
better to use xargs instead:
Code:
find . -type f -newer './This American Life/203.mp3' -iname '*.mp3' -print0 | xargs -0 -I {} cp {} /media/mp3-player
So why use the 'exec' test? ... use it as an actual
test within find.
Let's say that you've broken up with your girlfriend 'Brunhilde', and you want to delete all text files containing her name, without the pain of having to go back and read them all...
find . -type f -exec grep -qi 'brunhilde' {} | xargs rm -f
(don't do this if you want to keep your collection of Wagner lyrics intact... as a matter of fact, don't do this at all).
Ok... so, getting back around to the original question...
I want to use
file {} | grep 'MPEG.*layer III' as my test ... it will return true, so it should work within find ... -exec...
Unfortunately this fails horribly:
Code:
find . -type f -exec file {} | grep 'MPEG.*layer III' \; -print0 | xargs du -ach
because the pipe breaks the find command in to two expressions.
so I went googling, and found
this
I tried it out...
Code:
find . -type f -exec sh -c 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' {} {} \; -print0 | xargs -0 du
It's
horribly slow. As a matter of fact, it's so slow that when I started to write this post, I was writing about how it was hanging, and trying to figure out what was wrong with it. I was running it across my Music directory, which is a 1.9 G directory containing 507 files, with 3 mp3 files in it.
First question: why does it work? I can see that I'm calling 'sh -c' which is executing 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' with the argument '{}', which is how find -exec expresses the matching files... but what is the second '{}' for?
Second, how to increase performance?
I ran the following to figure out whether 'file' was causing the bottleneck
Code:
$ find . -type f -print0 | xargs -0 file
./SALT - Seminars About Long Term Thinking/podcast-2010-06-16-moses.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 64 kbps, 44.1 kHz, Monaural
...
./debra_music/Tesla - Paradise (Acoustic).m4a: ISO Media, MPEG v4 system, version 2
./MN0035426.gif: GIF image data, version 89a, 300 x 400
./3807510_01.jpg: JPEG image data, JFIF standard 1.01
./This American Life/203.mp3: Audio file with ID3 version 2.2.0, contains: MPEG ADTS, layer III, v1, 64 kbps, 44.1 kHz, Monaural
real 0m0.386s
user 0m0.288s
sys 0m0.036s
real time is under half a second, so this isn't causing the problem.
I think that
sh -c 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' {} is going to spawn 3 processes per file: one for 'sh -c', one for 'file $1' and one for the grep... is the overhead for spawning 1500 processes going to be
that big?
I waited over 25 minutes (1500 seconds), and it still hadn't finished... that would mean over a second per spawned process, which doesn't sound right at all. I also figured that maybe something might be handed to 'du' that might make it run forever (I'm expecting that it would only be handed files, but maybe I made a mistake, and it's running du across the file system multiple times)... so I piped the xargs to 'echo', a second time around, and that ran slow...
I started chopping things out...
Code:
time find . -type f -exec sh -c 'echo $1 > /dev/null' {} {} \; -print0 | xargs -0 echo
runs in under a second.
Code:
time find . -type f -exec file {} \;
runs in a second and a half
but
Code:
time find . -type f -exec sh -c 'file $1' {} {} \; -print0 | xargs -0 echo
hangs on me... I don't get it.