Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I think the error message means that you have egrep styled patterns included in your pattern file.
I'll use the -f option to search an inventory list in a number of video devices against a list of spots that should have been deleted. You have the right idea, but either need to use the -E option, or use egrep or modify your pattern file. I'm not certain how the --count option would work together with a pattern file. It looks like --count may present the number of lines containing a match and not the number of matches. So a line with two matches would be counted once. I also don't know if the --count and -f <pattern-file> arguments would work together.
Look at what the output looks like with and without --count. Maybe with the -o option you can use another filter like awk. Or maybe using awk would be a better solution.
Also make sure you didn't prepare or edit the pattern.txt file in windows. If you did use the dos2unix program to convert it.
Another possible thing that could trip you up is if you use the wrong text encoding scheme in pattern.txt.
Something like:
grep -f pattern.txt temp -o | sort -u
Can give you a uniq sorted list of matching strings. You could use the output as a pattern list in an awk program, or in a loop. Some of items listed may have matched the same pattern however, so you may not be able to get away with using something like 'egrep -c "$PATTERN_ITEM" document.txt' in a loop, either in a script or using counters in awk.
One thing to look at could be to use "grep" to produce a list of matching items and use this list to construct a script that you finally use. Of course, you may end up using a perl script instead if you can't find a general utility.
Here is a prototype showing one way -- you can modify as you need:
Code:
#!/usr/bin/env sh
# @(#) s3 Demonstrate parsing and counting alphabetic strings.
set -o nounset
echo
debug=":"
debug="echo"
## Use local command version for the commands in this demonstration.
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version bash grep sort uniq
echo
FILE=data1
PATTERN=pattern1
echo " Input file $FILE, pattern file $PATTERN:"
cat $FILE
echo
cat $PATTERN
# separate, select, order, count
echo
echo " Results from tr, grep, sort, uniq:"
tr -c '[:alpha:]' '\n' <$FILE |
tr '[:upper:]' '[:lower:]' |
grep -i -f $PATTERN |
sort |
uniq -c
exit 0
Producing:
Code:
% ./s3
(Versions displayed with local utility "version")
GNU bash 2.05b.0
grep (GNU grep) 2.5.1
sort (coreutils) 5.2.1
uniq (coreutils) 5.2.1
Input file data1, pattern file pattern1:
I like blue.
Red are some roses, but also many are white.
Navy blue is nice, but not light blue.
I am wearing an orange shirt.
My face is red from shoveling so much white snow.
The US flag is red, white, and blue.
black
blue
orange
purple
red
white
yellow
Results from tr, grep, sort, uniq:
4 blue
1 orange
3 red
3 white
See man / info pages for details ... cheers, makyo
PS I think this thread would be better if placed in Programming.
How would your script handle the egrep pattern:
(cats and dogs|mice)
The simple grep would not handle alternation (at the least), the script would need egrep for that.
I made an assumption that, with the 5000 "patterns" mentioned, each was probably a simple string.
If the OP desires something more complicated, we'll need to see samples of the input and pattern files, and perhaps think of other approaches, such as a perl script as you mentioned.
Reviewing the script, I'd probably change the grep to fgrep (or add -F) to omit doing anything with regular expressions, just to avoid complications, and likely to save some memory.
I'm guessing that you asked to help the OP understand the limitations of the script. That was why I called it a prototype, and suggested that it was really a starting point for the solution. If not, then I missed the point of your question -- sorry, not enough caffeine yet this morning ... cheers, makyo
Maybe your assumption is correct. I was thinking of regex patterns such as you might use to detect spam. The OP's original error message seems to indicate to me that an extended regex was used as one of the entries without using egrep. If the OP's pattern string contains only literals, the your suggestion of using '-F' instead is a good one.
Breaking the document into lines of single words is something I didn't think about. That is what allows using the core text utilities such as sort, uniq and grep to get the job done.
When using sed to edit or translate a group of documents (from one document type to another for example), one has to do a lot of tweaking to cover special cases, like split words on two lines, a phrase being split across two line, punctuation, etc. The OP's task may entail tweaking as well. For example, is it for generating an index? Should book and books be counted separately? It the purpose for generating word frequency statistics to help design spam filters?
One follow-up after rescanning your first example.
Code:
#!/usr/bin/env sh
I realize that this makes a script portable, and keeps the Solaris ksh users happy, but doesn't using /usr/bin/env introduce a potential security problem if a script is run as root ( if called from an suid script or program ) by using the users environment? On Linux SUID scripts aren't allowed, but on Solaris they might be. I'm not criticizing your script here. This is commonly used for portability. I wonder if it shouldn't be.
My real purpose for this question is to use grep to get the output lines with the fastest speed. I agree with you two on using other tools like awk to process the output data by grep, specially answering the second question.
I can accept number of lines instead of number of matches, that's, ignore two matches in a line.
My real case is a little complex as what I'm processing is a Chinese Documents containing both Chinese Characters and English Letters, digits, signals and laugh icons as well.
But I can put it simple, and give you some hints on the real problem.
a ID was assigned to each pattern, and patterns should be finally replaced by patter-IDs. The number of matches (Number of lines can be also acceptable) of IDs are needed to be reported.
The documents for these patterns, as mentioned above, are Chinese documents. Character sets like [:alpha:]cannot be used in Chinese, and no tabs or spaces between any two Chinese words except a word segmentation has been applied, which is rather complex for my question here.
If you have seen other scripts that I have posted, I usually use:
Code:
FILE=${1-data1}
but the current situation would require 2 arguments, and things would get complicated if we consider possibly omitting the first argument, so I would likely use getopts here, and in this case that would add far too much complexity, detracting from our real question of the approach.
Quote:
One follow-up after rescanning your first example.
Code:
#!/usr/bin/env sh
I realize that this makes a script portable, and keeps the Solaris ksh users happy, but doesn't using /usr/bin/env introduce a potential security problem if a script is run as root ( if called from an suid script or program ) by using the users environment? On Linux SUID scripts aren't allowed, but on Solaris they might be. I'm not criticizing your script here. This is commonly used for portability. I wonder if it shouldn't be.
Yes, you are quite correct. Albing, et al in O'Reilly's bash Cookbook, page 321, make note of the possible security problem, but they consider it a minor one. I post in a number of different places, so I came down on the side of portability for scripts that I post. For my personal scripts, I usually use a shebang line like:
Code:
#!/bin/bash -
to avoid some forms of spoofing as they describe on page 283. For the portability aspects of this construct, I use the venerable perl script fixit to find and replace processor path names.
I will consider placing a comment in my posting template to warn of the possible security implications of the env construct.
Thank you for your comments; it always helps to have another pair of eyes on the code ... cheers, (your neighbor) makyo
Given that the patterns in the pattern file are regular expressions, you want to use the argument -f instead of -F. It seems that you want to isolate the non-Chinese characters and then count their appearances. You could replace non-alphanumeric characters with spaces as a start. This will break up the rest into words. example:
The "sort -f" command will sort the words while ignoring the case of the letters. This will group identical words on adjacent lines. The "uniq -ic" command will eliminate duplicates and count how many words there are.
If you want to isolate individual patterns you could put a grep filter in between somewhere to filter words based on your pattern list. --- I'm not sure by what you meant not being able to use [[:alpha:]] classes. I used it here partly out of habit, and partly because I took you to mean that the Chinese characters don't match them. I thought since I was negating, that wouldn't matter. Maybe [^a-zA-Z0-9] would work for you. P.S. In editing, certain characters were replaced by html tags & xml alias patterns.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.