LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 02-12-2009, 07:46 AM   #1
int0x80
Member
 
Registered: Sep 2002
Posts: 310

Rep: Reputation: Disabled
Unhappy Scaling grep


Each day I generate about 70GB of data total across ~30 different text files (/home/int0x80/data/*.txt). Then these text files need to be searched for ~750 patterns (all_patterns.txt). My first approach was to try a simple grep -f:

Code:
grep -h -f all_patterns.txt /home/int0x80/data/*.txt >> matches_$(date +%m%d%Y).txt
This was invalid as it took more than half a day to grep for just one pattern. After reading http://www.fam.tuwien.ac.at/~schaman..._pattern_files, I tried this approach with a small tweak:

Code:
#!/bin/bash
split -l 20 all_patterns.txt all_patterns.txt.split.
for chunk in all_patterns.txt.split.*; do
    grep -h -f "$chunk" /home/int0x80/data/*.txt >> matches_$(date +%m%d%Y).txt &
done
This has been running for 18 hours and is still going. Does anyone have any experience with, or recommendations for, dealing with data sets this large?

My box has a single quad-core Xeon E5420 and 3GB of RAM.

TIA
 
Old 02-12-2009, 08:27 AM   #2
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 167Reputation: 167
Quote:
Originally Posted by int0x80 View Post
Each day I generate about 70GB ...<SNIP>... TIA
Can you give us some sample data from your patterns file? Is it comma delimited, one per line, etc? How complex are the patterns you're searching for?

You may be using the wrong tool in all honesty. Perl, sed, and awk may be better suited to dealing with this, although no matter what parsing 70gb of raw text 750 times (51.26 *terabytes* of data read)... is going to take a while. A better solution might be to store it in a database or putting it on a ram drive to decrease read times.

One solution that might work using perl is to read say, Xgb chunks into memory and then perform all 750 operations in one go on that chunk, then drop it outta memory and read the next chunk in.

You could also simplify your existing script most likely by doing:

Code:
for chunk in `cat pattern_file.txt`; do
  grep -h -f "$chunk" /home/int0x80/data/*.txt >> matches_$(date +%m%d%Y).txt
done
(and yes, I know this is doing the patterns one at a time vs chunks of patterns)

Also I notice you're backgrounding each of the greps, which in the case of your script means you've got 20 backgrounded copies of grep running clobbering the io of your drive. Number of processors and amount of ram aren't going to help you if you're causing a bottleneck on the drives.

Another minor thing that might help, is changing to ext4 or xfs since they tend to allocate files for sequential read better than ext2/3 (notably faster.) Also mirroring (raid 1)would probably be of some help also if you're not already doing it.

Last edited by rweaver; 02-12-2009 at 08:50 AM.
 
Old 02-12-2009, 08:37 AM   #3
int0x80
Member
 
Registered: Sep 2002
Posts: 310

Original Poster
Rep: Reputation: Disabled
Thanks for the reply. The patterns are delimited by a newline, as required by the -f option in grep. The patterns are not complex, all but a few are just simple text that could be found by strstr or something comparable.

The ram disk is an interesting idea; though some of the files are greater in size than the amount of ram in the system. What would you suggest doing with perl/sed/awk that might be faster than grep?
 
Old 02-12-2009, 08:55 AM   #4
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
I haven't looked at the source code of grep to see how it manages multiple patterns, but the idea that you could speed it up by dividing the pattern set into chunks is very counterintuitive. If correct, it implies grep have a very ineffective method of dealing with multiple patterns.

It may be worthwhile to just write a new C++ program to do the search. There may be no general purpose tools that can get decent performance against that much data.

Do the patterns themselves include regular expression features or are they simple text? Also how long are they?

If they are simple text, that would greatly simplify the task of writing a new program and increase the odds that such a program could significantly outperform grep. Also, if the patterns are fairly long, a program taking advantage of that could greatly improve its performance in doing 750 searches in parallel. Simply knowing the typical pattern length, even if they aren't fairly long, would let you make some design choices for the parallel search that would beat the performance of a general purpose tool that doesn't know the typical pattern length.

I expect you also know some things about character frequencies that could speed things up a lot, by choosing an algorithm that is helped by that. For example, consider the distribution of the 750 characters that are the first character of each pattern, vs. the 750 characters that are the last character of each pattern, vs. the 70GB that are the input set.

Quote:
Originally Posted by rweaver View Post
no matter what parsing 70gb of raw text 750 times (51.26 *terabytes* of data read)... is going to take a while.
If the tool you're using does anything like that, then it is not an effective tool for doing a parallel search for 750 patterns.

The sane approach is to do some kind of preprocessing of the set of patterns to create a control table to allow the parallel search with far less than 750 comparisons per input character.

Last edited by johnsfine; 02-12-2009 at 09:05 AM.
 
Old 02-12-2009, 09:30 AM   #5
int0x80
Member
 
Registered: Sep 2002
Posts: 310

Original Poster
Rep: Reputation: Disabled
Thanks for the ideas. Less than 1% of the patterns require a regular expression, in fact all but a couple patterns are simple ASCII strings less than 20 characters in length.

WRT splitting the patterns, the link in the OP demonstrates empirical data that grep performs optimally with pattern sets of size 20 when using the -f flag.

So just to make sure I understand correctly, you are advocating writing a compiled program to essentially read lines from the text files and search them using strstr or a comparable function. Is this accurate? I don't know anything about parallel computing, but will begin with Google :]

Edit: Just now saw the addendum on your post regarding frequency analysis. So essentially it might be possible to write a leaner/more efficient strstr/strcmp function knowing the character distribution and patterns, eh?

Last edited by int0x80; 02-12-2009 at 09:33 AM.
 
Old 02-12-2009, 09:38 AM   #6
int0x80
Member
 
Registered: Sep 2002
Posts: 310

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by rweaver View Post
Also I notice you're backgrounding each of the greps, which in the case of your script means you've got 20 backgrounded copies of grep running clobbering the io of your drive. Number of processors and amount of ram aren't going to help you if you're causing a bottleneck on the drives.

Another minor thing that might help, is changing to ext4 or xfs since they tend to allocate files for sequential read better than ext2/3 (notably faster.) Also mirroring (raid 1)would probably be of some help also if you're not already doing it.
Good points. I hadn't considered the filesystem, perhaps a format may be in order.
 
Old 02-12-2009, 10:08 AM   #7
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by int0x80 View Post
Thanks for the ideas. Less than 1% of the patterns require a regular expression, in fact all but a couple patterns are simple ASCII strings less than 20 characters in length.
Unfortunately, most of the regular expression considerations are coding complexity issues in which 1% might be just as bad as 100%. But maybe you have some more restrictive info on the structure of the regular expressions that will keep things relatively simple.

Quote:
So just to make sure I understand correctly, you are advocating writing a compiled program to essentially read lines from the text files and search them using strstr or a comparable function.
Certainly not strstr or anything similar. There are many approaches. The most general would be similar to the Boyer-Moore-Horspool algorithm (which just simplifies Boyer-Moore by leaving out a step that is theoretically important for worst case, but not usually helpful in real data).

1) The preprocessor needs to select an anchor point in each of the 750 strings. There are a few considerations in selecting an anchor point:
A) The characters that can occur at the anchor point of a given string should have low total probability in the full data. For example, in ordinary text if you could select a pattern position that is a lower case "e" or a position that is a case insensitive "z" (meaning it might be Z or z), I expect the total probability of z and Z in the target text is less than the total probability of e.
B) If there is enough structural similarity between the 750 patterns to make this possible, you should select anchor points such that the entire set of characters that can occur one position before the anchor point across all 750 patterns is less than the entire set of target characters. If possible, extend that N characters before the anchor.
C) You need to be able to compare backwards from the anchor point if it isn't the first character and forwards if it isn't the last. Regular expressions may make that tricky. Otherwise it isn't a consideration.

2) You make a 256 entry table pointing to 256 different lists of pointers to patterns, such that each pattern is pointed to by all possible values of its anchor position (that case insensitive z would be pointed to by 'z' and 'Z')

3) If you achieved any (B) above, you make a skip table, such that any character that can be 1 before the anchor of any pattern has a 1, any that can't be 1 but can be 2 has a 2, any that can't be 1 or 2 but can be three has a 3, etc.

4) The major operation of search is to consider one input position at a time and look it up in table (2) and test the before and after anchor parts of each pattern in the list found in table 2. (Don't test the anchor position because you already know it matches).

5) After that, you look the character up in table (3) and advance your input pointer that far.

Whether this method is any good depends on how well you can hit goals A and B in preprocessing the pattern. Of course you don't want to preread the whole input data to determine the character frequencies. I'm assuming you know quite a bit about the character frequencies from knowledge about the source of that data.

If you want to post a bunch of your typical pattern strings, I may be able to make some of the above more understandable with examples, or maybe I'll say an entirely different approach is better.

Quote:
read lines from the text files
If lines are a fundamental unit in this process, that may also simplify things. Can a pattern include or match the newline character so it spans lines? Or can a match only occur within a line?

My main description above doesn't give you a good way to make use of your four cores. But if there are any frequent "hard boundaries" (that no pattern can span) in the input, such as line breaks, that makes it easy to make good use of four cores: Once the preprocessing is done each thread can:
1) Lock out other threads
2) Read several lines of input
3) Unlock
4) Search within the several lines just read.
If searching is slower than reading, the lock won't cost much. If searching is faster than reading, multi core couldn't help anyway, nor could a better search algorithm.

For an input size this big, it may be better to use a pair of characters as the anchor, rather than a single character. That means your two tables are each 65K entries instead of 256 entries, which may have some bad cache consequences. But all the probability considerations get so much better with pairs of characters that it is likely worth it.

The effectiveness of (B) should go up enough with pairs of characters that you need the anchor point on the shortest pattern to be at the end in order to avoid loss of effectiveness in (B).

Last edited by johnsfine; 02-12-2009 at 10:52 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trying to understand pipes - Can't pipe output from tail -f to grep then grep again lostjohnny Linux - Newbie 15 03-12-2009 10:31 PM
how to grep multiple filters with grep LinuxLover Linux - Enterprise 1 10-18-2007 07:12 AM
grep output on stdout and grep output to file don't match xnomad Linux - General 3 01-13-2007 04:56 AM
bash script with grep and sed: sed getting filenames from grep odysseus.lost Programming 1 07-17-2006 11:36 AM
ps -ef|grep -v root|grep apache<<result maelstrombob Linux - Newbie 1 09-24-2003 11:38 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 01:24 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration