LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 12-27-2007, 08:51 PM   #1
xiawinter
LQ Newbie
 
Registered: Aug 2007
Posts: 27

Rep: Reputation: 15
how to grep 5k patterns at a time?


HI All,

I have to find if 5000 patterns existing in a document and if any, how many times for each pattern?

I put the patterns in a file named pattern.txt, and I grep using the following command:

grep -FEio -f pattern.txt document.txt

it failed, and reported: "grep: conflicting matchers specified"


My questions are:
1. how to grep so many patters at once (not loop one by one)?
2. how to find how many times each pattern appear in the document?

any ideas on this will be greatly appreciated. thanks in advance.

--
Samuel
 
Old 12-27-2007, 09:27 PM   #2
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
I think the error message means that you have egrep styled patterns included in your pattern file.
I'll use the -f option to search an inventory list in a number of video devices against a list of spots that should have been deleted. You have the right idea, but either need to use the -E option, or use egrep or modify your pattern file. I'm not certain how the --count option would work together with a pattern file. It looks like --count may present the number of lines containing a match and not the number of matches. So a line with two matches would be counted once. I also don't know if the --count and -f <pattern-file> arguments would work together.

Look at what the output looks like with and without --count. Maybe with the -o option you can use another filter like awk. Or maybe using awk would be a better solution.

Also make sure you didn't prepare or edit the pattern.txt file in windows. If you did use the dos2unix program to convert it.

Another possible thing that could trip you up is if you use the wrong text encoding scheme in pattern.txt.

Something like:
grep -f pattern.txt temp -o | sort -u
Can give you a uniq sorted list of matching strings. You could use the output as a pattern list in an awk program, or in a loop. Some of items listed may have matched the same pattern however, so you may not be able to get away with using something like 'egrep -c "$PATTERN_ITEM" document.txt' in a loop, either in a script or using counters in awk.

One thing to look at could be to use "grep" to produce a list of matching items and use this list to construct a script that you finally use. Of course, you may end up using a perl script instead if you can't find a general utility.

Last edited by jschiwal; 12-27-2007 at 09:53 PM.
 
Old 12-27-2007, 10:11 PM   #3
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

Here is a prototype showing one way -- you can modify as you need:
Code:
#!/usr/bin/env sh

# @(#) s3       Demonstrate parsing and counting alphabetic strings.

set -o nounset
echo

debug=":"
debug="echo"

## Use local command version for the commands in this demonstration.

echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version bash grep sort uniq

echo
FILE=data1
PATTERN=pattern1
echo " Input file $FILE, pattern file $PATTERN:"
cat $FILE
echo
cat $PATTERN

# separate, select, order, count

echo
echo " Results from tr, grep, sort, uniq:"
tr -c '[:alpha:]' '\n' <$FILE |
tr '[:upper:]' '[:lower:]' |
grep -i -f $PATTERN |
sort |
uniq -c

exit 0
Producing:
Code:
% ./s3

(Versions displayed with local utility "version")
GNU bash 2.05b.0
grep (GNU grep) 2.5.1
sort (coreutils) 5.2.1
uniq (coreutils) 5.2.1

 Input file data1, pattern file pattern1:
I like blue.
Red are some roses, but also many are white.
Navy blue is nice, but not light blue.
I am wearing an orange shirt.
My face is red from shoveling so much white snow.
The US flag is red, white, and blue.

black
blue
orange
purple
red
white
yellow

 Results from tr, grep, sort, uniq:
      4 blue
      1 orange
      3 red
      3 white
See man / info pages for details ... cheers, makyo

PS I think this thread would be better if placed in Programming.

Last edited by makyo; 12-27-2007 at 10:18 PM.
 
Old 12-28-2007, 01:11 AM   #4
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
How would your script handle the egrep pattern:
(cats and dogs|mice)
 
Old 12-28-2007, 03:54 AM   #5
xiawinter
LQ Newbie
 
Registered: Aug 2007
Posts: 27

Original Poster
Rep: Reputation: 15
Thanks very much, Makyo.

frankly speaking, I can't follow your codes very well now. I'd like to read it tonight. But I found your results are exactly what I need.

Thanks again, and I hope it can work for me.
 
Old 12-28-2007, 05:17 AM   #6
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi, jschiwal.
Quote:
Originally Posted by jschiwal View Post
How would your script handle the egrep pattern:
(cats and dogs|mice)
The simple grep would not handle alternation (at the least), the script would need egrep for that.

I made an assumption that, with the 5000 "patterns" mentioned, each was probably a simple string.

If the OP desires something more complicated, we'll need to see samples of the input and pattern files, and perhaps think of other approaches, such as a perl script as you mentioned.

Reviewing the script, I'd probably change the grep to fgrep (or add -F) to omit doing anything with regular expressions, just to avoid complications, and likely to save some memory.

I'm guessing that you asked to help the OP understand the limitations of the script. That was why I called it a prototype, and suggested that it was really a starting point for the solution. If not, then I missed the point of your question -- sorry, not enough caffeine yet this morning ... cheers, makyo
 
Old 12-28-2007, 07:35 PM   #7
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
Maybe your assumption is correct. I was thinking of regex patterns such as you might use to detect spam. The OP's original error message seems to indicate to me that an extended regex was used as one of the entries without using egrep. If the OP's pattern string contains only literals, the your suggestion of using '-F' instead is a good one.

Breaking the document into lines of single words is something I didn't think about. That is what allows using the core text utilities such as sort, uniq and grep to get the job done.

When using sed to edit or translate a group of documents (from one document type to another for example), one has to do a lot of tweaking to cover special cases, like split words on two lines, a phrase being split across two line, punctuation, etc. The OP's task may entail tweaking as well. For example, is it for generating an index? Should book and books be counted separately? It the purpose for generating word frequency statistics to help design spam filters?


The interactive style
Code:
echo
FILE=data1
PATTERN=pattern1
echo " Input file $FILE, pattern file $PATTERN:"
is something I would use arguments for instead.

---

One follow-up after rescanning your first example.
Code:
#!/usr/bin/env sh
I realize that this makes a script portable, and keeps the Solaris ksh users happy, but doesn't using /usr/bin/env introduce a potential security problem if a script is run as root ( if called from an suid script or program ) by using the users environment? On Linux SUID scripts aren't allowed, but on Solaris they might be. I'm not criticizing your script here. This is commonly used for portability. I wonder if it shouldn't be.

Last edited by jschiwal; 12-28-2007 at 07:56 PM.
 
Old 12-28-2007, 07:47 PM   #8
xiawinter
LQ Newbie
 
Registered: Aug 2007
Posts: 27

Original Poster
Rep: Reputation: 15
Thanks, Jschiwal and makyo.

My real purpose for this question is to use grep to get the output lines with the fastest speed. I agree with you two on using other tools like awk to process the output data by grep, specially answering the second question.

I can accept number of lines instead of number of matches, that's, ignore two matches in a line.

My real case is a little complex as what I'm processing is a Chinese Documents containing both Chinese Characters and English Letters, digits, signals and laugh icons as well.

But I can put it simple, and give you some hints on the real problem.

Patterns (5K totally):
Code:
nokia.{0,15}N7[1-9][0-9]{1,2}
nokia.{0,15}N?6[1-9][0-9]{3}
(moto|motorola).{0,10}E[2|6]i?
a ID was assigned to each pattern, and patterns should be finally replaced by patter-IDs. The number of matches (Number of lines can be also acceptable) of IDs are needed to be reported.

The documents for these patterns, as mentioned above, are Chinese documents. Character sets like [:alpha:]cannot be used in Chinese, and no tabs or spaces between any two Chinese words except a word segmentation has been applied, which is rather complex for my question here.

A sample text (from google): http://mobsmania.blogspot.com/2007/1...and-k750i.html

So I hope this time I make me understood.

Thanks for your kindly help.
--
Samuel

Last edited by xiawinter; 12-28-2007 at 08:00 PM.
 
Old 12-28-2007, 08:42 PM   #9
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi, jschiwal
Quote:
Originally Posted by jschiwal View Post
Maybe your assumption is correct ...
The OP has answered some of our questions.

Quote:
The interactive style
Code:
echo
FILE=data1
PATTERN=pattern1
echo " Input file $FILE, pattern file $PATTERN:"
is something I would use arguments for instead.
If you have seen other scripts that I have posted, I usually use:
Code:
FILE=${1-data1}
but the current situation would require 2 arguments, and things would get complicated if we consider possibly omitting the first argument, so I would likely use getopts here, and in this case that would add far too much complexity, detracting from our real question of the approach.
Quote:
One follow-up after rescanning your first example.
Code:
#!/usr/bin/env sh
I realize that this makes a script portable, and keeps the Solaris ksh users happy, but doesn't using /usr/bin/env introduce a potential security problem if a script is run as root ( if called from an suid script or program ) by using the users environment? On Linux SUID scripts aren't allowed, but on Solaris they might be. I'm not criticizing your script here. This is commonly used for portability. I wonder if it shouldn't be.
Yes, you are quite correct. Albing, et al in O'Reilly's bash Cookbook, page 321, make note of the possible security problem, but they consider it a minor one. I post in a number of different places, so I came down on the side of portability for scripts that I post. For my personal scripts, I usually use a shebang line like:
Code:
#!/bin/bash -
to avoid some forms of spoofing as they describe on page 283. For the portability aspects of this construct, I use the venerable perl script fixit to find and replace processor path names.

I will consider placing a comment in my posting template to warn of the possible security implications of the env construct.

Thank you for your comments; it always helps to have another pair of eyes on the code ... cheers, (your neighbor) makyo
 
Old 12-29-2007, 01:18 AM   #10
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
Given that the patterns in the pattern file are regular expressions, you want to use the argument -f instead of -F. It seems that you want to isolate the non-Chinese characters and then count their appearances. You could replace non-alphanumeric characters with spaces as a start. This will break up the rest into words. example:
Code:
 sed 's/[^[:alnum:]]/ /g' document.txt | tr -s ' ' | tr ' ' '\n' | sort -f | uniq -ic
The &quot;sort -f&quot; command will sort the words while ignoring the case of the letters. This will group identical words on adjacent lines. The &quot;uniq -ic&quot; command will eliminate duplicates and count how many words there are.
If you want to isolate individual patterns you could put a grep filter in between somewhere to filter words based on your pattern list. --- I'm not sure by what you meant not being able to use [[:alpha:]] classes. I used it here partly out of habit, and partly because I took you to mean that the Chinese characters don't match them. I thought since I was negating, that wouldn't matter. Maybe [^a-zA-Z0-9] would work for you. P.S. In editing, certain characters were replaced by html tags & xml alias patterns.

Last edited by jschiwal; 12-29-2007 at 01:28 AM.
 
  


Reply

Tags
egrep, grep


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
grep for multiple patterns???? lucastic Linux - Software 4 08-06-2010 06:07 PM
Command or grep to know system reboot time rockcharles1 Linux - General 2 10-05-2007 03:49 PM
Exclude certain patterns from grep? kinetik Linux - General 4 04-24-2006 05:37 AM
Remembering patterns and printing only those patterns using sed bernie82 Programming 5 05-26-2005 05:18 PM
ps -ef|grep -v root|grep apache<<result maelstrombob Linux - Newbie 1 09-24-2003 11:38 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 12:51 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration