LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   word count issue (https://www.linuxquestions.org/questions/programming-9/word-count-issue-602672/)

George2 11-27-2007 02:29 AM

word count issue
 
Hello everyone,


What is the command to find the number of word *FOO* in a given file (e.g. goo.txt)?


thanks in advance,
George

matthewg42 11-27-2007 02:47 AM

You have to combine some commands. If you can split the file into one word per line, then you can pipe that into grep -c.

The tr command can be used to do that splitting. You could also use awk, perl or sed or some other tools, but tr is probably the smallest program to invoke, so it's a good choice. In this example we split on spaces or tabs. It is simple to add other word splitting characters if you wish.

Code:

tr ' \t' '\n\n' input_file |grep -c '^foo$'

George2 11-27-2007 03:44 AM

Thanks matthewg42,


Quote:

Originally Posted by matthewg42 (Post 2971944)
You have to combine some commands. If you can split the file into one word per line, then you can pipe that into grep -c.

The tr command can be used to do that splitting. You could also use awk, perl or sed or some other tools, but tr is probably the smallest program to invoke, so it's a good choice. In this example we split on spaces or tabs. It is simple to add other word splitting characters if you wish.

Code:

tr ' \t' '\n\n' input_file |grep -c '^foo$'

Why the following command will split by space or tab or \n?

tr ' \t' '\n\n'


regards,
George

radoulov 11-27-2007 03:56 AM

Assuming GNU grep and no embedded newlines in FOO:

Code:

grep -o '\bFOO\b' goo.txt|wc -l
grep -c won't consider FOO FOO on the same line as two matches
(that's why wc -l).

George2 11-27-2007 05:31 AM

Thanks radoulov,


Why you add \b before and after FOO?

Quote:

Originally Posted by radoulov (Post 2971993)
Assuming GNU grep and no embedded newlines in FOO:

Code:

grep -o '\bFOO\b' goo.txt|wc -l
grep -c won't consider FOO FOO on the same line as two matches
(that's why wc -l).


regards,
George

radoulov 11-27-2007 05:49 AM

Quote:

Originally Posted by George2 (Post 2972074)
Thanks radoulov,


Why you add \b before and after FOO?

\b is a word boundary,
consider this:

Code:

$ print 'FOO,FOO
FOOFOO
"FOO"
(FOO)
xFOOx'|grep FOO   
FOO,FOO
FOOFOO
"FOO"
(FOO)
xFOOx
$ print 'FOO,FOO
FOOFOO
"FOO"
(FOO)
xFOOx'|grep  '^FOO$' 
$
$ print 'FOO,FOO
FOOFOO
"FOO"
(FOO)
xFOOx'|grep '\bFOO\b'
FOO,FOO
"FOO"
(FOO)

For more info check word boundaries

Edit: GNU grep has the -w option with the same meaning.

matthewg42 11-27-2007 06:11 AM

Quote:

Originally Posted by George2 (Post 2971980)
Thanks matthewg42,
Why the following command will split by space or tab or \n?

tr ' \t' '\n\n'


regards,
George

tr read two lists of characters, and then goes through all the input and translates any instance of any character in the first list with the corresponding character in the second list.

Consider these examples:
Code:

% echo "This is my input string" | tr 'itp' 'IT_'
tr replaces all instances of 'i' with 'I', all instance of 't' with 'T' and all instance of 'p' with '_'. Thus the output is:
Code:

ThIs Is my In_uT sTrIng
You can have as many characters as you like in the two parameters. You can also use ranges of characters, so long as the ranges match length:
Code:

% echo "This is my input string" | tr '[a-j]' '[0-9]'
T78s 8s my 8nput str8n6
% echo "This is my input string" | tr 'abcdefghik' 'xxxxxxxxxx'
Txxs xs my xnput strxnx

A space is represented simply with a space character. Tabs are represented with '\t', new lines with '\n'. Hence the behaviour of the original tr command:
Code:

% echo "This is my input string" | tr ' \t' '\n\n'
This
is
my
input
string

Once the output is one word per line like that, you can use grep -c to count all lines which match the pattern. Since you want to match the whole word, you can add ^ and $ around the pattern. Alternatively you could use the -x option to grep to match the whole line, so a slight different version of the originally suggested command is like this:
Code:

tr ' \t' '\n\n' < input_file |grep -cx 'foo'
I made a small mistake in the original command, thinking tr takes an option third parameter being the name of an input file. This is not the case, so I used input re-direction with the < operator.


All times are GMT -5. The time now is 04:45 AM.