LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-27-2012, 05:39 AM   #1
nanthagopal
LQ Newbie
 
Registered: Aug 2012
Posts: 3

Rep: Reputation: Disabled
Shell Script - Performance issue while using find command


Hi,


I have created Shell Script for Automation Process


I have 4 Sub-Process. for each process i want to grep some string and finally print each sub-process results.

I have used find | xargs grep comand for grepping, the code as follows


Quote:
find . -type f ! -name "*.bz2" ! -name "*~" -name "catalina.*" | xargs grep xxx | grep "xx\|xx\|xx\|xx" | grep "status"
It takes 7-8 minutes for grepping the entire records in the path.


Note:

For each sub-process, i have to grep the full records. so, approximately 7*4 = 28 mins to finish the entire process.

So, how to improve the performance.

Please Suggest me.

Regards,
Nanthagopal A
 
Old 10-27-2012, 09:06 AM   #2
shivaa
Senior Member
 
Registered: Jul 2012
Location: Grenoble, Fr.
Distribution: Sun Solaris, RHEL, Ubuntu, Debian 6.0
Posts: 1,800
Blog Entries: 4

Rep: Reputation: 286Reputation: 286Reputation: 286
First thing, I don't understand the use of "!" in find command, is it really needed there?
Second, degraded performance could be because of file sizes i.e. "catalina.*", "*~" and "*.bz2" and lot of filters that you're using. Every filter is working here and takes time to manipulate the stream of data coming to it.
So it's better to break the command in 3 part of every name types and add 3 commands instead of using -name 3 times, as follow:
Code:
#!/bin/bash
fileslist=$(cat /tmp/filelist)
find . -name "*.bz2" -type f -print > /tmp/filelist
find . -name "*~" -type f -print >> /tmp/filelist
find . -name "catalina.*" -type f -print >> /tmp/filelist
for file in $fileslist
do
more $file | grep xxx |grep "xx\|xx\|xx\|xx" | grep "status" >> /tmp/status.txt
done

Last edited by shivaa; 10-27-2012 at 09:12 AM.
 
Old 10-27-2012, 09:45 AM   #3
wpeckham
Senior Member
 
Registered: Apr 2010
Location: USA
Distribution: Debian, Ubuntu, Fedora, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, Vsido, tinycore, Q4OS
Posts: 1,648

Rep: Reputation: 568Reputation: 568Reputation: 568Reputation: 568Reputation: 568Reputation: 568
well, performance, eh...

Shivva:
1. he wanted BETTER performance that would be SLOWER. Also, incorrect - see the next point.
2. The ! negates the next condition. He wants to SKIP those files whose names match the following -name pattern.

To the problem.
1. I suspect that finding the list of files is fairly quick. You can save some time by loading that list into a variable, and operating the greps against the files in the list. That will save you only a very little bit of time, by using more memory. XARGS is a fine tool, but only efficient when a command can be run efficiently against a list of objects at one go. I am nor certain that you are accomplishing that here.

2. As you suspect by your statement of the problem, those multiple greps are likely to be your bottleneck. Arriving at a solution that ran filters in the background against each file and assembled the results only after all of the child processes were done would make better use of the multithreading and multitasking features of the shell and kernel and might improve performance significantly. (Again, using more memory and disk for temp storage of results is the tradeoff.)

I would have to know more about the environment and objective to craft an efficient solution, but that hint might be enough to allow you to make progress. Best of luck!

----added later
so instead of
Code:
find . -type f ! -name "*.bz2" ! -name "*~" -name "catalina.*" | xargs grep xxx | grep "xx\|xx\|xx\|xx" | grep "status"
perhaps something like
Code:
LIST=`find . -type f ! -name "*.bz2" ! -name "*~" -name "catalina.*"
fgrep 'xxx' $LIST |grep "xx\|xx\|xx\|xx" | fgrep 'status'
might be a bit faster.

The real speed would come if you can fun a grep filter set against each file in the list at the same time. You would have to invent a plan for storing the temp results for each file in a safe location, merging the results after all of the filters had completed, and cleaning up the workspace after it was all done.

Last edited by wpeckham; 10-27-2012 at 09:59 AM. Reason: addendum
 
Old 10-27-2012, 12:14 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Arch
Posts: 3,013

Rep: Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225
Quote:
Originally Posted by wpeckham View Post
XARGS is a fine tool, but only efficient when a command can be run efficiently against a list of objects at one go. I am nor certain that you are accomplishing that here.
I'm pretty sure it is being accomplished here. I doubt using shell a variable will speed anything up. Using fgrep (or grep -F) could be helpful though (though if reading the files off the disk is the bottleneck, it won't help much). The middle grep can also be converted:
Code:
grep "xx\|xx\|xx\|xx" 
# is equivalent to
grep -F -e xx -e xx -e xx -e xx
 
Old 10-27-2012, 01:05 PM   #5
Reuti
Senior Member
 
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 13.1
Posts: 1,326

Rep: Reputation: 253Reputation: 253Reputation: 253
Each grep will be a process on its own and these fight for the cores in the machine. I’m tempted to replace the three greps with one regular expression using
Code:
$ find … | xargs grep -E …

Last edited by Reuti; 10-27-2012 at 01:07 PM. Reason: Formatting
 
Old 10-28-2012, 01:41 PM   #6
wpeckham
Senior Member
 
Registered: Apr 2010
Location: USA
Distribution: Debian, Ubuntu, Fedora, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, Vsido, tinycore, Q4OS
Posts: 1,648

Rep: Reputation: 568Reputation: 568Reputation: 568Reputation: 568Reputation: 568Reputation: 568
Agreed

Using parallel processes would be the greatest advantage, but you have hit on a key point: every event pipes multiple grep loads and filters. If that could be optimized to a single call and filter I would expect a significant improvement.

Combining the two techniques might be the best answer, but I am not sure the degree or improvement would reward the complexity. Were it my project, I would design several solutions and run some test trials to determine which would best serve my purpose.
 
Old 10-28-2012, 03:49 PM   #7
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: CentOS
Posts: 2,957

Rep: Reputation: 1267Reputation: 1267Reputation: 1267Reputation: 1267Reputation: 1267Reputation: 1267Reputation: 1267Reputation: 1267Reputation: 1267
The 4 sub-processes and that whole chain of grep processes could be replaced by a single awk (or perl, or any of several other scripting languages) process that read all the input files once and processed each line according to the search string it matched. Without a bit more detail about what is really being done in that chain and some sample strings it's hard to elaborate further.

Last edited by rknichols; 10-28-2012 at 03:50 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Shell command/script to find duplicate strings in a text file damgar General 4 09-04-2012 02:17 PM
How to pass command line arguments from one shell script to another shell script VijayaRaghavanLakshman Linux - Newbie 5 01-20-2012 10:12 PM
How to find the memory address(in FLASH) of a command in my shell script ? diemons Linux - Newbie 3 03-29-2011 02:48 PM
Grab results of GRUB's find command into a shell script variable kushalkoolwal Programming 6 02-03-2009 06:04 PM
Req. Shell script: Network performance test altu Solaris / OpenSolaris 6 08-09-2004 12:45 AM


All times are GMT -5. The time now is 06:45 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration