grep help

aggressivebloodcell · 06-26-2007, 05:57 PM

Hey all,

I need some help using grep. How would I do a word count for specific words from a file. Lets say I am searching for apples, bannanas and grapes from a text file and need to output the frequency of those words. I don't want to do a grep of those words one by one.

Much Appreciated,

abc

jschiwal · 06-26-2007, 06:15 PM

I think it would be easier to use sed or tr to replace whitespace with newlines; sort the output; use grep -f wordlist to filter out the words you don't want; use uniq to count the occurances of each word.

#break up text into word list (using tr) |
#filter out unwanted words (using grep -f wordlist |
#sort |
#count the words (using uniq -c)

You may need to use sed somewhere in the pipeline to assemble patterns like "them-\nselves" -> "themselves". It depends on the format of the text file. You may see patterns like "my-\n\tself" or "my-\n self" that need to be fixed as well. If you use "tr" to replace returns and all whitespace with single spaces, you could pipe the output through sed to remove " -" patterns; then run through 'tr' again to change all of the spaces to returns.
Also, in the word list, you will want to remove punctuation for periods, so that you don't have seperate entries for "book" and "book." for example.

Examine the output of each part of the pipe work flow to make sure that the output is what you expect. A lot of the tweaking is adjusting the options.

macemoneta · 06-26-2007, 06:16 PM

Here's one way:

Code:

grep -o "apples\|bananas\|grapes" somefile.txt | sort | uniq -c

pixellany · 06-26-2007, 06:18 PM

grep -o personnel tmpfile|wc -w
finds the word "personnel" and counts the occurences. If you want to find several different words in 1 pass, I think you have to make a small script.

(Pseudo-code)
for i in <list of words>
count=grep $i <filename>|wc
printf (or echo) $1, $count

man grep for more on how grep works

AwesomeMachine · 06-26-2007, 07:21 PM

tr ' ' '\n' < file.txt | sort | uniq -c

Tinkster · 06-26-2007, 07:34 PM

And, as (almost) always, an awk version:

Code:

 awk 'BEGIN{RS=" +"} /word1/ || /word2/ || /word3/ {word[$1]++}{for (i in word){print i" : "word[i]}}' file

Cheers, Tink

aggressivebloodcell · 06-27-2007, 04:43 PM

Thanks all.. you guys are very fast at replying.

-abc