LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   how to grep 5k patterns at a time? (http://www.linuxquestions.org/questions/linux-software-2/how-to-grep-5k-patterns-at-a-time-609525/)

xiawinter 12-27-2007 09:51 PM

how to grep 5k patterns at a time?
 
HI All,

I have to find if 5000 patterns existing in a document and if any, how many times for each pattern?

I put the patterns in a file named pattern.txt, and I grep using the following command:

grep -FEio -f pattern.txt document.txt

it failed, and reported: "grep: conflicting matchers specified"


My questions are:
1. how to grep so many patters at once (not loop one by one)?
2. how to find how many times each pattern appear in the document?

any ideas on this will be greatly appreciated. thanks in advance.

--
Samuel

jschiwal 12-27-2007 10:27 PM

I think the error message means that you have egrep styled patterns included in your pattern file.
I'll use the -f option to search an inventory list in a number of video devices against a list of spots that should have been deleted. You have the right idea, but either need to use the -E option, or use egrep or modify your pattern file. I'm not certain how the --count option would work together with a pattern file. It looks like --count may present the number of lines containing a match and not the number of matches. So a line with two matches would be counted once. I also don't know if the --count and -f <pattern-file> arguments would work together.

Look at what the output looks like with and without --count. Maybe with the -o option you can use another filter like awk. Or maybe using awk would be a better solution.

Also make sure you didn't prepare or edit the pattern.txt file in windows. If you did use the dos2unix program to convert it.

Another possible thing that could trip you up is if you use the wrong text encoding scheme in pattern.txt.

Something like:
grep -f pattern.txt temp -o | sort -u
Can give you a uniq sorted list of matching strings. You could use the output as a pattern list in an awk program, or in a loop. Some of items listed may have matched the same pattern however, so you may not be able to get away with using something like 'egrep -c "$PATTERN_ITEM" document.txt' in a loop, either in a script or using counters in awk.

One thing to look at could be to use "grep" to produce a list of matching items and use this list to construct a script that you finally use. Of course, you may end up using a perl script instead if you can't find a general utility.

makyo 12-27-2007 11:11 PM

Hi.

Here is a prototype showing one way -- you can modify as you need:
Code:

#!/usr/bin/env sh

# @(#) s3      Demonstrate parsing and counting alphabetic strings.

set -o nounset
echo

debug=":"
debug="echo"

## Use local command version for the commands in this demonstration.

echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version bash grep sort uniq

echo
FILE=data1
PATTERN=pattern1
echo " Input file $FILE, pattern file $PATTERN:"
cat $FILE
echo
cat $PATTERN

# separate, select, order, count

echo
echo " Results from tr, grep, sort, uniq:"
tr -c '[:alpha:]' '\n' <$FILE |
tr '[:upper:]' '[:lower:]' |
grep -i -f $PATTERN |
sort |
uniq -c

exit 0

Producing:
Code:

% ./s3

(Versions displayed with local utility "version")
GNU bash 2.05b.0
grep (GNU grep) 2.5.1
sort (coreutils) 5.2.1
uniq (coreutils) 5.2.1

 Input file data1, pattern file pattern1:
I like blue.
Red are some roses, but also many are white.
Navy blue is nice, but not light blue.
I am wearing an orange shirt.
My face is red from shoveling so much white snow.
The US flag is red, white, and blue.

black
blue
orange
purple
red
white
yellow

 Results from tr, grep, sort, uniq:
      4 blue
      1 orange
      3 red
      3 white

See man / info pages for details ... cheers, makyo

PS I think this thread would be better if placed in Programming.

jschiwal 12-28-2007 02:11 AM

How would your script handle the egrep pattern:
(cats and dogs|mice)

xiawinter 12-28-2007 04:54 AM

Thanks very much, Makyo.

frankly speaking, I can't follow your codes very well now. I'd like to read it tonight. But I found your results are exactly what I need.

Thanks again, and I hope it can work for me.

makyo 12-28-2007 06:17 AM

Hi, jschiwal.
Quote:

Originally Posted by jschiwal (Post 3003602)
How would your script handle the egrep pattern:
(cats and dogs|mice)

The simple grep would not handle alternation (at the least), the script would need egrep for that.

I made an assumption that, with the 5000 "patterns" mentioned, each was probably a simple string.

If the OP desires something more complicated, we'll need to see samples of the input and pattern files, and perhaps think of other approaches, such as a perl script as you mentioned.

Reviewing the script, I'd probably change the grep to fgrep (or add -F) to omit doing anything with regular expressions, just to avoid complications, and likely to save some memory.

I'm guessing that you asked to help the OP understand the limitations of the script. That was why I called it a prototype, and suggested that it was really a starting point for the solution. If not, then I missed the point of your question -- sorry, not enough caffeine yet this morning ... cheers, makyo

jschiwal 12-28-2007 08:35 PM

Maybe your assumption is correct. I was thinking of regex patterns such as you might use to detect spam. The OP's original error message seems to indicate to me that an extended regex was used as one of the entries without using egrep. If the OP's pattern string contains only literals, the your suggestion of using '-F' instead is a good one.

Breaking the document into lines of single words is something I didn't think about. That is what allows using the core text utilities such as sort, uniq and grep to get the job done.

When using sed to edit or translate a group of documents (from one document type to another for example), one has to do a lot of tweaking to cover special cases, like split words on two lines, a phrase being split across two line, punctuation, etc. The OP's task may entail tweaking as well. For example, is it for generating an index? Should book and books be counted separately? It the purpose for generating word frequency statistics to help design spam filters?


The interactive style
Code:

echo
FILE=data1
PATTERN=pattern1
echo " Input file $FILE, pattern file $PATTERN:"

is something I would use arguments for instead.

---

One follow-up after rescanning your first example.
Code:

#!/usr/bin/env sh
I realize that this makes a script portable, and keeps the Solaris ksh users happy, but doesn't using /usr/bin/env introduce a potential security problem if a script is run as root ( if called from an suid script or program ) by using the users environment? On Linux SUID scripts aren't allowed, but on Solaris they might be. I'm not criticizing your script here. This is commonly used for portability. I wonder if it shouldn't be.

xiawinter 12-28-2007 08:47 PM

Thanks, Jschiwal and makyo.

My real purpose for this question is to use grep to get the output lines with the fastest speed. I agree with you two on using other tools like awk to process the output data by grep, specially answering the second question.

I can accept number of lines instead of number of matches, that's, ignore two matches in a line.

My real case is a little complex as what I'm processing is a Chinese Documents containing both Chinese Characters and English Letters, digits, signals and laugh icons :) as well.

But I can put it simple, and give you some hints on the real problem.

Patterns (5K totally):
Code:

nokia.{0,15}N7[1-9][0-9]{1,2}
nokia.{0,15}N?6[1-9][0-9]{3}
(moto|motorola).{0,10}E[2|6]i?

a ID was assigned to each pattern, and patterns should be finally replaced by patter-IDs. The number of matches (Number of lines can be also acceptable) of IDs are needed to be reported.

The documents for these patterns, as mentioned above, are Chinese documents. Character sets like [:alpha:]cannot be used in Chinese, and no tabs or spaces between any two Chinese words except a word segmentation has been applied, which is rather complex for my question here.

A sample text (from google): http://mobsmania.blogspot.com/2007/1...and-k750i.html

So I hope this time I make me understood.

Thanks for your kindly help.
--
Samuel

makyo 12-28-2007 09:42 PM

Hi, jschiwal
Quote:

Originally Posted by jschiwal (Post 3004324)
Maybe your assumption is correct ...

The OP has answered some of our questions.

Quote:

The interactive style
Code:

echo
FILE=data1
PATTERN=pattern1
echo " Input file $FILE, pattern file $PATTERN:"

is something I would use arguments for instead.
If you have seen other scripts that I have posted, I usually use:
Code:

FILE=${1-data1}
but the current situation would require 2 arguments, and things would get complicated if we consider possibly omitting the first argument, so I would likely use getopts here, and in this case that would add far too much complexity, detracting from our real question of the approach.
Quote:

One follow-up after rescanning your first example.
Code:

#!/usr/bin/env sh
I realize that this makes a script portable, and keeps the Solaris ksh users happy, but doesn't using /usr/bin/env introduce a potential security problem if a script is run as root ( if called from an suid script or program ) by using the users environment? On Linux SUID scripts aren't allowed, but on Solaris they might be. I'm not criticizing your script here. This is commonly used for portability. I wonder if it shouldn't be.
Yes, you are quite correct. Albing, et al in O'Reilly's bash Cookbook, page 321, make note of the possible security problem, but they consider it a minor one. I post in a number of different places, so I came down on the side of portability for scripts that I post. For my personal scripts, I usually use a shebang line like:
Code:

#!/bin/bash -
to avoid some forms of spoofing as they describe on page 283. For the portability aspects of this construct, I use the venerable perl script fixit to find and replace processor path names.

I will consider placing a comment in my posting template to warn of the possible security implications of the env construct.

Thank you for your comments; it always helps to have another pair of eyes on the code ... cheers, (your neighbor) makyo

jschiwal 12-29-2007 02:18 AM

Given that the patterns in the pattern file are regular expressions, you want to use the argument -f instead of -F. It seems that you want to isolate the non-Chinese characters and then count their appearances. You could replace non-alphanumeric characters with spaces as a start. This will break up the rest into words. example:
Code:

sed 's/[^[:alnum:]]/ /g' document.txt | tr -s ' ' | tr ' ' '\n' | sort -f | uniq -ic
The &quot;sort -f&quot; command will sort the words while ignoring the case of the letters. This will group identical words on adjacent lines. The &quot;uniq -ic&quot; command will eliminate duplicates and count how many words there are.
If you want to isolate individual patterns you could put a grep filter in between somewhere to filter words based on your pattern list. --- I'm not sure by what you meant not being able to use [[:alpha:]] classes. I used it here partly out of habit, and partly because I took you to mean that the Chinese characters don't match them. I thought since I was negating, that wouldn't matter. Maybe [^a-zA-Z0-9] would work for you. P.S. In editing, certain characters were replaced by html tags & xml alias patterns.


All times are GMT -5. The time now is 09:49 PM.