LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 06-26-2010, 01:30 PM   #1
bartonski
Member
 
Registered: Jul 2006
Location: Louisville, KY
Distribution: Fedora 12, Slackware, Debian, Ubuntu Karmic, FreeBSD 7.1
Posts: 443
Blog Entries: 1

Rep: Reputation: 47
Using a pipe inside 'find ... -exec' ... how does it work, and why is it so slow?


celembor started a thread here asking how to get disk usage for all files named '*.mp3' on his file system.

druuna and I collectively came up with this:

Code:
find / -type f -iname '*.mp3' -print0 | xargs -0 du -ach
This will do a reasonable job of finding all files named '*.mp3'... but what happens if the a user decides to be sneaky, and renames the files to avoid detection? To find this, we'll need to use the 'file' command.

'file' contains a database of identifying byte patterns, also known as 'magic numbers', and searches for these patterns within a file of unknown type. It returns a text string which should tell us what type of file it is. for example:
Code:
$ file 203.mp3 
203.mp3: Audio file with ID3 version 2.2.0, contains: MPEG ADTS, layer III, v1,  64 kbps, 44.1 kHz, Monaural
I think that any file for which file returns the pattern 'MPEG ... layer III' will be an mp3 file.

I'll test this:

Code:
$ file 203.mp3 | grep 'MPEG.*layer III' > /dev/null && echo 'this is an mp3 file!'
this is an mp3 file!
Ok... so now that I have a test that will return 'true' when a file is an mp3 file, I would like to use this within a find command.

the 'exec' test will execute a command on a given file. It is naively used to execute a command on files which match the preceding tests within a find, for example let's say that I want to copy all mp3 files newer than './This American Life/203.mp3' to an mp3 player that I have mounted at '/media/mp3-player'

I could run this command:

Code:
find . -type f  -newer './This American Life/203.mp3' -iname '*.mp3' -exec cp {} /media/mp3-player \;
This not the best solution, however, because it spawns a new process to copy each file.

better to use xargs instead:

Code:
find . -type f  -newer './This American Life/203.mp3' -iname '*.mp3' -print0 | xargs -0 -I {} cp {} /media/mp3-player
So why use the 'exec' test? ... use it as an actual test within find.

Let's say that you've broken up with your girlfriend 'Brunhilde', and you want to delete all text files containing her name, without the pain of having to go back and read them all...

find . -type f -exec grep -qi 'brunhilde' {} | xargs rm -f

(don't do this if you want to keep your collection of Wagner lyrics intact... as a matter of fact, don't do this at all).

Ok... so, getting back around to the original question...

I want to use file {} | grep 'MPEG.*layer III' as my test ... it will return true, so it should work within find ... -exec...

Unfortunately this fails horribly:

Code:
find . -type f -exec file {} | grep 'MPEG.*layer III' \; -print0 | xargs du -ach
because the pipe breaks the find command in to two expressions.

so I went googling, and found this

I tried it out...

Code:
find . -type f -exec sh -c 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' {} {} \; -print0 | xargs -0 du
It's horribly slow. As a matter of fact, it's so slow that when I started to write this post, I was writing about how it was hanging, and trying to figure out what was wrong with it. I was running it across my Music directory, which is a 1.9 G directory containing 507 files, with 3 mp3 files in it.

First question: why does it work? I can see that I'm calling 'sh -c' which is executing 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' with the argument '{}', which is how find -exec expresses the matching files... but what is the second '{}' for?

Second, how to increase performance?

I ran the following to figure out whether 'file' was causing the bottleneck
Code:
$ find . -type f -print0 | xargs -0 file
./SALT - Seminars About Long Term Thinking/podcast-2010-06-16-moses.mp3:                                                                        Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1,  64 kbps, 44.1 kHz, Monaural

...

./debra_music/Tesla - Paradise (Acoustic).m4a:                                                                                                  ISO Media, MPEG v4 system, version 2
./MN0035426.gif:                                                                                                                                GIF image data, version 89a, 300 x 400
./3807510_01.jpg:                                                                                                                               JPEG image data, JFIF standard 1.01
./This American Life/203.mp3:                                                                                                                   Audio file with ID3 version 2.2.0, contains: MPEG ADTS, layer III, v1,  64 kbps, 44.1 kHz, Monaural

real    0m0.386s
user    0m0.288s
sys    0m0.036s
real time is under half a second, so this isn't causing the problem.

I think that sh -c 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' {} is going to spawn 3 processes per file: one for 'sh -c', one for 'file $1' and one for the grep... is the overhead for spawning 1500 processes going to be that big?

I waited over 25 minutes (1500 seconds), and it still hadn't finished... that would mean over a second per spawned process, which doesn't sound right at all. I also figured that maybe something might be handed to 'du' that might make it run forever (I'm expecting that it would only be handed files, but maybe I made a mistake, and it's running du across the file system multiple times)... so I piped the xargs to 'echo', a second time around, and that ran slow...

I started chopping things out...
Code:
time find . -type f -exec sh -c 'echo $1 > /dev/null' {} {} \; -print0 | xargs -0 echo
runs in under a second.

Code:
time find . -type f -exec file {} \;
runs in a second and a half

but
Code:
time find . -type f -exec sh -c 'file $1' {} {} \; -print0 | xargs -0 echo
hangs on me... I don't get it.
 
Old 06-26-2010, 07:17 PM   #2
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Hi,

that's an interesting problem. After some research I found the following in the man page:
Quote:
sh -c [some options...] command_string [command_name [argument ...]]
...
If command line arguments besides the options have been specified, then the shell treats the first argument as the
name of a file from which to read commands (a shell script), and the remaining arguments are set as the positional
parameters of the shell ($1, $2, etc). Otherwise, the shell reads commands from its standard input.
...
-c Read commands from the command_string operand instead of from the standard input. Special parameter 0 will be set from the command_name operand and the positional parameters ($1, $2, etc.) set from the remaining argument operands.
...
This sounds more complicated than it actually is. In fact, it is plain simple. A little example will clarify this (hopefully):
Suppose you have a traditional script, like
Code:
#!/bin/sh
echo $1
echo $0
Now call this script with
Code:
./script positional_param_one
This is equivalent to
Code:
sh -c 'echo $1; echo $0' script positional_param_one
So the first {} is your "script name". You can change it to anything you want, like
Code:
find . -type f -exec sh -c 'file $1 | grep "MPEG ADTS, layer III" > /dev/null' whatever {} \; -print0 | xargs -0 du
The second {} is the important one.



As for the slowness, as you already stated the exec option will spawn a new process (shell) for every result. Now I assume (I'm really not sure about this) that spawning a shell might take some more overhead than spawning a normal command. On top of that you execute a pipe with sh which produces another subshell. Even without the pipe I assume there would be at least two processes spawned - one for sh and another one for the command executed by sh. And I guess that determining the filetype by examining its structure - as done by file - takes its toll, too. Again, this is only an assumption by me. If after conducting some further tests you come to another conclusion I'd appreciate it if you shared your insights.
 
1 members found this post helpful.
Old 06-26-2010, 08:28 PM   #3
bartonski
Member
 
Registered: Jul 2006
Location: Louisville, KY
Distribution: Fedora 12, Slackware, Debian, Ubuntu Karmic, FreeBSD 7.1
Posts: 443
Blog Entries: 1

Original Poster
Rep: Reputation: 47
Thanks for the explanation of the positional parameters... that made perfect sense.

Quote:
Originally Posted by crts View Post
As for the slowness, as you already stated the exec option will spawn a new process (shell) for every result. Now I assume (I'm really not sure about this) that spawning a shell might take some more overhead than spawning a normal command. On top of that you execute a pipe with sh which produces another subshell. Even without the pipe I assume there would be at least two processes spawned - one for sh and another one for the command executed by sh. And I guess that determining the filetype by examining its structure - as done by file - takes its toll, too. Again, this is only an assumption by me. If after conducting some further tests you come to another conclusion I'd appreciate it if you shared your insights.
I think that it's some weird interaction between 'find -exec', 'file' and 'sh -c'. This hangs:

Code:
find . -type f -exec sh -c 'file $1' {} {} \; -print0 | xargs -0 echo
I've removed the pipes and re-direction, so I've cut down the number of processes by half (I think). Furthermore, I let this run for an hour and a half on the same 508 files. I don't think that this is just a matter of some overhead.

If I remove the wrapper of 'sh -c' around the call to file,

Code:
time find . -type f -exec file {} \;
This runs in under a second and a half.

Going the other direction by replacing the call to 'file' with 'echo'

Code:
time find . -type f -exec sh -c 'echo $1 > /dev/null' {} {} \; -print0 | xargs -0 echo
runs in under a second... so it's neither the call to 'file' or the use of 'sh -c' alone that causes the hang. I'm not sure what to test next...

Last edited by bartonski; 06-26-2010 at 08:31 PM.
 
Old 06-26-2010, 08:41 PM   #4
bartonski
Member
 
Registered: Jul 2006
Location: Louisville, KY
Distribution: Fedora 12, Slackware, Debian, Ubuntu Karmic, FreeBSD 7.1
Posts: 443
Blog Entries: 1

Original Poster
Rep: Reputation: 47
Just to prove Finagle's law, this works just fine...

Code:
$ sh -c 'file $1 | grep -q "MPEG.*layer III"' {} 203.mp3 && echo "TRUE"
TRUE
 
Old 06-26-2010, 09:27 PM   #5
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
that nasty whitespace ...

Hi,
try the following
Code:
time find . -type f -exec sh -c 'file "$1"' {} {} \; -print0 | xargs -0 echo
As it appears the problem was with 'file $1' having trouble to cope with whitespaces in filenames. After throwing tons of errors it came to a point where it just hung up doing nothing, leaving the shell completely irresponsive. Since you were operating on mp3 files I bet you had whitespace characters in there, too.
But I do not know why it hung up after throwing hundreds of errors? Maybe a bug in file? It should be able to at least cope with input that it cannot process in some way that won't render the system useless.

[EDIT]
Right now I am still running the command. The output is 'chunkwise', i.e. it hangs for a couple of seconds and then outputs a couple of screens. I think this has something to do with the buffer from xargs. While investigating this problem I encountered an error like:
'xargs: arguments line too long'
or something like that. Don't remember exactly. I'll let you know how it went when it finishes.

[UPDATE]
Ok, it finished now without any errors. Approximately 30,000 files were processed in under 15 minutes.

Last edited by crts; 06-26-2010 at 11:23 PM. Reason: update
 
1 members found this post helpful.
Old 06-27-2010, 01:38 AM   #6
bartonski
Member
 
Registered: Jul 2006
Location: Louisville, KY
Distribution: Fedora 12, Slackware, Debian, Ubuntu Karmic, FreeBSD 7.1
Posts: 443
Blog Entries: 1

Original Poster
Rep: Reputation: 47
DoH!
 
  


Reply

Tags
du, find, sh, xargs


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Find with -exec argument - not giving proper output..how to find... hinetvenkat Linux - Server 4 01-25-2010 06:19 AM
exec cmd=perl... work but exec cgi doenst crions Slackware 5 12-09-2005 12:17 PM
Pipe inside variable isn't working in bash Reginald0 Linux - General 5 01-30-2004 09:43 AM


All times are GMT -5. The time now is 06:46 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration