LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Using Grep with Pattern File and PCRE (https://www.linuxquestions.org/questions/linux-newbie-8/using-grep-with-pattern-file-and-pcre-834495/)

bigbot 09-25-2010 07:10 PM

Using Grep with Pattern File and PCRE
 
I would like to write a newline delimeted rules file using PCREs for use with the grep command. Grep has the option -f to obtain the search pattern from a file, and option -P to search using PCREs. However, these two options do not work together. The -f option only seems to work with fixed string rules.

A friend previously helped me get around this limitation somehow, but I can't remember how he did it. I also would like the ability to add comments at the end of each rule in the file.

Example Rules File-
Code:

'^John.*Sally$'                              # 09/22/10 - Per Steve Johnson
'Jack\.Ripper[0-9]{1,3}$'                    # 06/15/09 - Remove on 07/01/09

Here is what I've tried so far-
Code:

cat data.file | grep -P -f rules.file        # Doesn't work
cat data.file | grep -P 'rule1|rule2|rule3'  # Works but I want to pull rules from a file and be able to add comments at the end of the lines

Thanks for any help!

fuubar2003 09-25-2010 08:25 PM

I'm not getting what yer tryin to do.

One thing (and I know this is not addressing your question), you can lose the 'cat' part before the pipe and just run:

grep -P -f rules.file data.file


Per the grep man page, -P is experimental....

kingzog 09-26-2010 02:02 AM

Quote:

Originally Posted by fuubar2003 (Post 4109041)
I'm not getting what yer tryin to do.

One thing (and I know this is not addressing your question), you can lose the 'cat' part before the pipe and just run:

grep -P -f rules.file data.file


Per the grep man page, -P is experimental....

I believe what the original poster is trying to do is to have one call to grep work for multiple patterns. So if I had a file that had two lines in it, one "Hello" and the other "World", I could do something akin to "grep -f HelloWorld.txt *", and have every line containing EITHER hello or world show in the results.

I don't know how to do this with grep. This is possible with awk and perl scripting, but that's probably more time consuming than the poster wants.


One alternative might be, since you're using regular expressions, to wrap the whole thing in a series of OR groups. So for "Hello" and "World" you could use "(Hello|World)"

fuubar2003 09-26-2010 06:07 AM

The '-e' switch allows for multiple expressions so you can do a 'grep -e <item> -e <item>' over and over again....can also do the multiple grep's seperated by pipes but that is so lame.

I used awk to find any lines with IMAGE and print column 6 and find any lines with FRAG and print column 9:
awk '{ if($1=="IMAGE") print $6; if($1=="FRAG") print $9}'


Not sure if any of this is helpful.

theNbomr 09-26-2010 02:04 PM

You could iterate over all patterns in your file, and run grep with individual patterns.
Code:

while read pattern; do
    # remove trailing comment
    regex=${pattern%#*}
    grep $regex data.file
done < pattern.file

--- rod.

chrism01 09-26-2010 06:46 PM

You could look at egrep for more advanced options.

bigbot 09-27-2010 12:35 AM

Thank you for the responses and that's an interesting solution theNbomr. Unfortunately my Linux computer is not working so well right now so I am forced to use the Windows machine to post. However, I did get in touch with my friend and he reminded me about how we did this before. This isn't going to be 100% correct but I will edit it later when I am able to test it.

Code:

grep -P '`cat rules.file | sed -r 's/ *\#.*//' | tr '\n' '\|' | sed -r 's/\|$//'`' data.file
So basically we cat the rules file out, remove all spaces before the #, the # itself, and everything after the #. Then the newlines of the rules.file are replaced with pipes. Finally the last newline in the rules file (which will screw up grep) is removed.

Grep *should* interpret this command as:

Code:

grep -P 'rule1|rule2|rule3' data.file
Phew! When running this yesterday it seemed to work pretty well. The only thing I couldn't get to work was a "grep -vP '^$' " statement to remove all blank lines. I put that in after the cat statement, but kept getting some weird variable error when the whole thing was run. I know that would work on the command line, so I'm not sure what the problem is.

grail 09-27-2010 01:17 AM

So ultimately it appears you could replace all of:
Code:

grep -P '`cat rules.file | sed -r 's/ *\#.*//' | tr '\n' '\|' | sed -r 's/\|$//'`' data.file
with something like (untested):
Code:

grep -P $(awk -F="[ \t]*#" '{print $1}' rules.file) data.file

ghostdog74 09-27-2010 01:18 AM

Quote:

Originally Posted by theNbomr (Post 4109593)
You could iterate over all patterns in your file, and run grep with individual patterns.
Code:

while read pattern; do
    # remove trailing comment
    regex=${pattern%#*}
    grep $regex data.file
done < pattern.file

--- rod.


A better way is to concat the regex pattern first, then pass the pattern to grep instead of calling grep for every pattern iterated.
Code:

grep -E $(sed 's/[ \t]*#.*//' rules |tr "\n" "|"|sed 's/|$//') file

bigbot 09-30-2010 03:09 AM

Thank you for the responses and the awesome awk solution! Here is what ended up working (with the awk command as well):
Code:

grep -P "`cat rules.file | grep -vP '^$' | sed -r 's/\s*\#.*//' | tr '\n' '\|' | sed -r 's/\|$//'`" data.file
For some reason the double quotes were needed around the entire grep statement. Something to do with how bash interpreted the command. Single quotes would not work.

Also added the grep command to remove blank lines.

grail 09-30-2010 03:52 AM

Well I am glad you have a solution although I am not sure why you need so many calls to all the different apps. Calling cat is definitely not required

bigbot 10-01-2010 09:33 AM

I agree and I'm just not familiar with awk yet. I will use your example to see if I can use that instead.


All times are GMT -5. The time now is 04:46 AM.