LinuxQuestions.org - word count issue

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - word count issue (https://www.linuxquestions.org/questions/programming-9/word-count-issue-602672/)

word count issue

Hello everyone,

What is the command to find the number of word *FOO* in a given file (e.g. goo.txt)?

thanks in advance,
George

You have to combine some commands. If you can split the file into one word per line, then you can pipe that into grep -c.

The tr command can be used to do that splitting. You could also use awk, perl or sed or some other tools, but tr is probably the smallest program to invoke, so it's a good choice. In this example we split on spaces or tabs. It is simple to add other word splitting characters if you wish.

Code:

tr ' \t' '\n\n' input_file |grep -c '^foo$'

Thanks matthewg42,

Quote:

Originally Posted by matthewg42 (Post 2971944)

Code:

tr ' \t' '\n\n' input_file |grep -c '^foo$'

Why the following command will split by space or tab or \n?

tr ' \t' '\n\n'

regards,
George

Assuming GNU grep and no embedded newlines in FOO:

Code:

grep -o '\bFOO\b' goo.txt|wc -l

grep -c won't consider FOO FOO on the same line as two matches
(that's why wc -l).

Thanks radoulov,

Why you add \b before and after FOO?

Quote:

Originally Posted by radoulov (Post 2971993)

Assuming GNU grep and no embedded newlines in FOO:

Code:

grep -o '\bFOO\b' goo.txt|wc -l

grep -c won't consider FOO FOO on the same line as two matches
(that's why wc -l).

regards,
George

Quote:

Originally Posted by George2 (Post 2972074)

Thanks radoulov,

Why you add \b before and after FOO?

\b is a word boundary,
consider this:

Code:

$ print 'FOO,FOO

FOOFOO

"FOO"

(FOO)

xFOOx'|grep FOO    

FOO,FOO

FOOFOO

"FOO"

(FOO)

xFOOx

$ print 'FOO,FOO

FOOFOO

"FOO"

(FOO)

xFOOx'|grep  '^FOO$'  

$ 

$ print 'FOO,FOO

FOOFOO

"FOO"

(FOO)

xFOOx'|grep '\bFOO\b'

FOO,FOO

"FOO"

(FOO)

For more info check word boundaries

Edit: GNU grep has the -w option with the same meaning.

Quote:

Originally Posted by George2 (Post 2971980)

Thanks matthewg42,
Why the following command will split by space or tab or \n?

tr ' \t' '\n\n'

regards,
George

tr read two lists of characters, and then goes through all the input and translates any instance of any character in the first list with the corresponding character in the second list.

Consider these examples:

Code:

% echo "This is my input string" | tr 'itp' 'IT_'

tr replaces all instances of 'i' with 'I', all instance of 't' with 'T' and all instance of 'p' with '_'. Thus the output is:

Code:

ThIs Is my In_uT sTrIng

You can have as many characters as you like in the two parameters. You can also use ranges of characters, so long as the ranges match length:

Code:

% echo "This is my input string" | tr '[a-j]' '[0-9]'

T78s 8s my 8nput str8n6

% echo "This is my input string" | tr 'abcdefghik' 'xxxxxxxxxx'

Txxs xs my xnput strxnx

A space is represented simply with a space character. Tabs are represented with '\t', new lines with '\n'. Hence the behaviour of the original tr command:

Code:

% echo "This is my input string" | tr ' \t' '\n\n'

This

is

my

input

string

Once the output is one word per line like that, you can use grep -c to count all lines which match the pattern. Since you want to match the whole word, you can add ^ and $ around the pattern. Alternatively you could use the -x option to grep to match the whole line, so a slight different version of the originally suggested command is like this:

Code:

tr ' \t' '\n\n' < input_file |grep -cx 'foo'

I made a small mistake in the original command, thinking tr takes an option third parameter being the name of an input file. This is not the case, so I used input re-direction with the < operator.