[SOLVED] Shell Script - Performance issue while using find command
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
First thing, I don't understand the use of "!" in find command, is it really needed there?
Second, degraded performance could be because of file sizes i.e. "catalina.*", "*~" and "*.bz2" and lot of filters that you're using. Every filter is working here and takes time to manipulate the stream of data coming to it.
So it's better to break the command in 3 part of every name types and add 3 commands instead of using -name 3 times, as follow:
find . -name "*.bz2" -type f -print > /tmp/filelist
find . -name "*~" -type f -print >> /tmp/filelist
find . -name "catalina.*" -type f -print >> /tmp/filelist
for file in $fileslist
more $file | grep xxx |grep "xx\|xx\|xx\|xx" | grep "status" >> /tmp/status.txt
1. he wanted BETTER performance that would be SLOWER. Also, incorrect - see the next point.
2. The ! negates the next condition. He wants to SKIP those files whose names match the following -name pattern.
To the problem.
1. I suspect that finding the list of files is fairly quick. You can save some time by loading that list into a variable, and operating the greps against the files in the list. That will save you only a very little bit of time, by using more memory. XARGS is a fine tool, but only efficient when a command can be run efficiently against a list of objects at one go. I am nor certain that you are accomplishing that here.
2. As you suspect by your statement of the problem, those multiple greps are likely to be your bottleneck. Arriving at a solution that ran filters in the background against each file and assembled the results only after all of the child processes were done would make better use of the multithreading and multitasking features of the shell and kernel and might improve performance significantly. (Again, using more memory and disk for temp storage of results is the tradeoff.)
I would have to know more about the environment and objective to craft an efficient solution, but that hint might be enough to allow you to make progress. Best of luck!
The real speed would come if you can fun a grep filter set against each file in the list at the same time. You would have to invent a plan for storing the temp results for each file in a safe location, merging the results after all of the filters had completed, and cleaning up the workspace after it was all done.
Last edited by wpeckham; 10-27-2012 at 09:59 AM.
XARGS is a fine tool, but only efficient when a command can be run efficiently against a list of objects at one go. I am nor certain that you are accomplishing that here.
I'm pretty sure it is being accomplished here. I doubt using shell a variable will speed anything up. Using fgrep (or grep -F) could be helpful though (though if reading the files off the disk is the bottleneck, it won't help much). The middle grep can also be converted:
# is equivalent to
grep -F -e xx -e xx -e xx -e xx
Using parallel processes would be the greatest advantage, but you have hit on a key point: every event pipes multiple grep loads and filters. If that could be optimized to a single call and filter I would expect a significant improvement.
Combining the two techniques might be the best answer, but I am not sure the degree or improvement would reward the complexity. Were it my project, I would design several solutions and run some test trials to determine which would best serve my purpose.
The 4 sub-processes and that whole chain of grep processes could be replaced by a single awk (or perl, or any of several other scripting languages) process that read all the input files once and processed each line according to the search string it matched. Without a bit more detail about what is really being done in that chain and some sample strings it's hard to elaborate further.