Quote:
Originally Posted by int0x80
Thanks for the ideas. Less than 1% of the patterns require a regular expression, in fact all but a couple patterns are simple ASCII strings less than 20 characters in length.
|
Unfortunately, most of the regular expression considerations are coding complexity issues in which 1% might be just as bad as 100%. But maybe you have some more restrictive info on the structure of the regular expressions that will keep things relatively simple.
Quote:
So just to make sure I understand correctly, you are advocating writing a compiled program to essentially read lines from the text files and search them using strstr or a comparable function.
|
Certainly not strstr or anything similar. There are many approaches. The most general would be similar to the Boyer-Moore-Horspool algorithm (which just simplifies Boyer-Moore by leaving out a step that is theoretically important for worst case, but not usually helpful in real data).
1) The preprocessor needs to select an anchor point in each of the 750 strings. There are a few considerations in selecting an anchor point:
A) The characters that can occur at the anchor point of a given string should have low total probability in the full data. For example, in ordinary text if you could select a pattern position that is a lower case "e" or a position that is a case insensitive "z" (meaning it might be Z or z), I expect the total probability of z and Z in the target text is less than the total probability of e.
B) If there is enough structural similarity between the 750 patterns to make this possible, you should select anchor points such that the entire set of characters that can occur one position before the anchor point across all 750 patterns is less than the entire set of target characters. If possible, extend that N characters before the anchor.
C) You need to be able to compare backwards from the anchor point if it isn't the first character and forwards if it isn't the last. Regular expressions may make that tricky. Otherwise it isn't a consideration.
2) You make a 256 entry table pointing to 256 different lists of pointers to patterns, such that each pattern is pointed to by all possible values of its anchor position (that case insensitive z would be pointed to by 'z' and 'Z')
3) If you achieved any (B) above, you make a skip table, such that any character that can be 1 before the anchor of any pattern has a 1, any that can't be 1 but can be 2 has a 2, any that can't be 1 or 2 but can be three has a 3, etc.
4) The major operation of search is to consider one input position at a time and look it up in table (2) and test the before and after anchor parts of each pattern in the list found in table 2. (Don't test the anchor position because you already know it matches).
5) After that, you look the character up in table (3) and advance your input pointer that far.
Whether this method is any good depends on how well you can hit goals A and B in preprocessing the pattern. Of course you don't want to preread the whole input data to determine the character frequencies. I'm assuming you know quite a bit about the character frequencies from knowledge about the source of that data.
If you want to post a bunch of your typical pattern strings, I may be able to make some of the above more understandable with examples, or maybe I'll say an entirely different approach is better.
Quote:
read lines from the text files
|
If lines are a fundamental unit in this process, that may also simplify things. Can a pattern include or match the newline character so it spans lines? Or can a match only occur within a line?
My main description above doesn't give you a good way to make use of your four cores. But if there are any frequent "hard boundaries" (that no pattern can span) in the input, such as line breaks, that makes it easy to make good use of four cores: Once the preprocessing is done each thread can:
1) Lock out other threads
2) Read several lines of input
3) Unlock
4) Search within the several lines just read.
If searching is slower than reading, the lock won't cost much. If searching is faster than reading, multi core couldn't help anyway, nor could a better search algorithm.
For an input size this big, it may be better to use a pair of characters as the anchor, rather than a single character. That means your two tables are each 65K entries instead of 256 entries, which may have some bad cache consequences. But all the probability considerations get so much better with pairs of characters that it is likely worth it.
The effectiveness of (B) should go up enough with pairs of characters that you need the anchor point on the shortest pattern to be at the end in order to avoid loss of effectiveness in (B).