LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Shell Script - Performance issue while using find command (https://www.linuxquestions.org/questions/linux-newbie-8/shell-script-performance-issue-while-using-find-command-4175434327/)

nanthagopal 10-27-2012 04:39 AM

Shell Script - Performance issue while using find command
 
Hi,


I have created Shell Script for Automation Process


I have 4 Sub-Process. for each process i want to grep some string and finally print each sub-process results.

I have used find | xargs grep comand for grepping, the code as follows


Quote:

find . -type f ! -name "*.bz2" ! -name "*~" -name "catalina.*" | xargs grep xxx | grep "xx\|xx\|xx\|xx" | grep "status"

It takes 7-8 minutes for grepping the entire records in the path.


Note:

For each sub-process, i have to grep the full records. so, approximately 7*4 = 28 mins to finish the entire process.

So, how to improve the performance.

Please Suggest me.

Regards,
Nanthagopal A

shivaa 10-27-2012 08:06 AM

First thing, I don't understand the use of "!" in find command, is it really needed there?
Second, degraded performance could be because of file sizes i.e. "catalina.*", "*~" and "*.bz2" and lot of filters that you're using. Every filter is working here and takes time to manipulate the stream of data coming to it.
So it's better to break the command in 3 part of every name types and add 3 commands instead of using -name 3 times, as follow:
Code:

#!/bin/bash
fileslist=$(cat /tmp/filelist)
find . -name "*.bz2" -type f -print > /tmp/filelist
find . -name "*~" -type f -print >> /tmp/filelist
find . -name "catalina.*" -type f -print >> /tmp/filelist
for file in $fileslist
do
more $file | grep xxx |grep "xx\|xx\|xx\|xx" | grep "status" >> /tmp/status.txt
done


wpeckham 10-27-2012 08:45 AM

well, performance, eh...
 
Shivva:
1. he wanted BETTER performance that would be SLOWER. Also, incorrect - see the next point.
2. The ! negates the next condition. He wants to SKIP those files whose names match the following -name pattern.

To the problem.
1. I suspect that finding the list of files is fairly quick. You can save some time by loading that list into a variable, and operating the greps against the files in the list. That will save you only a very little bit of time, by using more memory. XARGS is a fine tool, but only efficient when a command can be run efficiently against a list of objects at one go. I am nor certain that you are accomplishing that here.

2. As you suspect by your statement of the problem, those multiple greps are likely to be your bottleneck. Arriving at a solution that ran filters in the background against each file and assembled the results only after all of the child processes were done would make better use of the multithreading and multitasking features of the shell and kernel and might improve performance significantly. (Again, using more memory and disk for temp storage of results is the tradeoff.)

I would have to know more about the environment and objective to craft an efficient solution, but that hint might be enough to allow you to make progress. Best of luck!

----added later
so instead of
Code:

find . -type f ! -name "*.bz2" ! -name "*~" -name "catalina.*" | xargs grep xxx | grep "xx\|xx\|xx\|xx" | grep "status"
perhaps something like
Code:

LIST=`find . -type f ! -name "*.bz2" ! -name "*~" -name "catalina.*"
fgrep 'xxx' $LIST |grep "xx\|xx\|xx\|xx" | fgrep 'status'

might be a bit faster.

The real speed would come if you can fun a grep filter set against each file in the list at the same time. You would have to invent a plan for storing the temp results for each file in a safe location, merging the results after all of the filters had completed, and cleaning up the workspace after it was all done.

ntubski 10-27-2012 11:14 AM

Quote:

Originally Posted by wpeckham (Post 4816140)
XARGS is a fine tool, but only efficient when a command can be run efficiently against a list of objects at one go. I am nor certain that you are accomplishing that here.

I'm pretty sure it is being accomplished here. I doubt using shell a variable will speed anything up. Using fgrep (or grep -F) could be helpful though (though if reading the files off the disk is the bottleneck, it won't help much). The middle grep can also be converted:
Code:

grep "xx\|xx\|xx\|xx"
# is equivalent to
grep -F -e xx -e xx -e xx -e xx


Reuti 10-27-2012 12:05 PM

Each grep will be a process on its own and these fight for the cores in the machine. I’m tempted to replace the three greps with one regular expression using
Code:

$ find … | xargs grep -E …

wpeckham 10-28-2012 12:41 PM

Agreed
 
Using parallel processes would be the greatest advantage, but you have hit on a key point: every event pipes multiple grep loads and filters. If that could be optimized to a single call and filter I would expect a significant improvement.

Combining the two techniques might be the best answer, but I am not sure the degree or improvement would reward the complexity. Were it my project, I would design several solutions and run some test trials to determine which would best serve my purpose.

rknichols 10-28-2012 02:49 PM

The 4 sub-processes and that whole chain of grep processes could be replaced by a single awk (or perl, or any of several other scripting languages) process that read all the input files once and processed each line according to the search string it matched. Without a bit more detail about what is really being done in that chain and some sample strings it's hard to elaborate further.


All times are GMT -5. The time now is 01:14 AM.