grep command

millgates · 04-18-2013, 09:56 AM

Quote:

Originally Posted by ntubski

When you give grep a list of regexps it checks each one for every line, so the runtime is O(Pn) (P is the number of patterns, n is number of lines to search in). This will be much faster with -F because then grep knows it has just plain strings and uses a much faster algorithm which is O(P+n). However, since we want to find occurrences only at the beginning of lines we can't use that in this case.

Here is an awk program which combines all the keywords into a single regexp so that the search should be O(P+n):

Code:

#!/usr/bin/awk -f

NR == FNR {
    for (i = 1; i <= length($0); i++) {
        char = substr($0, i, 1);
        if (!index(charsets[i], char))
            charsets[i] = charsets[i] char;
    }
}

function regexp_range(charset,    i, c, reg_range) {
    for (i = 1; i <= length(charset); i++) {
        c = substr(charset, i, 1);
        if (index("\\]-^", c))
            reg_range = reg_range "\\" c;
        else
            reg_range = reg_range c;
    }
    return "[" reg_range "]";
}

NR != FNR && !kw_regexp {
    kw_regexp = "^";
    for (i = 1; i in charsets; i++) {
        kw_regexp = kw_regexp regexp_range(charsets[i])
    }
    # print kw_regexp ; exit
}

NR != FNR && match($0, kw_regexp) {
    kw[substr($0, RSTART, RLENGTH)]++;
}

END {
    for(w in kw) {print w, kw[w];}
}

I'm not sure I fully understand your code, but I think your logic is flawed. If, for example, the keywoards searched are "ab" and "cde", it will create regex ^[ac][bd][e], which will match "abe", "ade" and "cbe", which we don't want, while not matching "ab", which is in our keyword list.

ntubski · 04-18-2013, 10:52 AM

Quote:

Originally Posted by millgates

I'm not sure I fully understand your code, but I think your logic is flawed. If, for example, the keywoards searched are "ab" and "cde", it will create regex ^[ac][bd][e], which will match "abe", "ade" and "cbe", which we don't want, while not matching "ab", which is in our keyword list.

You're right. Perhaps in a language with compiled regexps, the keywords could combined into keyword1|keyword2|keyword3|... instead and the regexp engine would compile it into something efficient, awk would forced to recompile every time so it wouldn't work out there.