LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   awk script - print number of words (https://www.linuxquestions.org/questions/programming-9/awk-script-print-number-of-words-4175593957/)

vincix 11-21-2016 08:24 AM

awk script - print number of words
 
I have the following script (from lynda's bash course) in a file called awk_script:
Quote:

{for (i=1;i<=NF;i++)
words[$i]++}
END{printf("is=%d,ls=%d,the=%d,with=%d\n", words["is"],words["ls"],words["the"],words["with"])}
The point of the script is to number the occurences of the respective words.

And I run the following:
Code:

man ls | col -b | awk -f awk_script
And I get some results.

My question is, how are the strings actually turned into digits? I understand "i" is used as a counter for "words". And then words are taken one at a time, so as the variable "words" is filled with these values. But it seems that this array variable already contains digits, and not strings, or am I wrong? And if so, how does it recognise a variable such as words["ls"]? How does it know to associate that with the number of occurences?

pan64 11-21-2016 08:45 AM

$1 is the first word of the line, $2 is the second, $3 is the third.... $NF is the last.
words[] is something called associative array, it will look like:
words["is"], words["ls"] and so on and all of them are counters which will be incremented by words[$i]++.

vincix 11-21-2016 09:11 AM

Quote:

Originally Posted by pan64 (Post 5632948)
$1 is the first word of the line, $2 is the second, $3 is the third.... $NF is the last.
words[] is something called associative array, it will look like:
words["is"], words["ls"] and so on and all of them are counters which will be incremented by words[$i]++.


How come all of them are counters? Is that something specific to arrays?

pan64 11-21-2016 09:29 AM

no, they are incremented (that is ++), so they will be used as counters of occurrence of the indices of the array (which are actually the words appeared in the input text)

grail 11-21-2016 11:10 AM

Instead of running code on such a large amount of data, use a small subset and check the output for yourself :)

As an example, from your own post:
Code:

echo 'My question is, how are the strings actually turned into digits? I understand "i" is used as a counter for "words".' | ./vincix.awk
is=1,ls=0,the=1,with=0


vincix 11-21-2016 01:39 PM

I simply don't understand. I've been trying to understand this script for some time, I haven't come to this forum right away. But there's something more complicated in here which might seem obvious to someone more knowledgeable, but not to me.

@grail the code is actually very small. I understand the printf part perfectly. What I really don't understand is how the words[$i] works exactly, that's why I was looking for someone who might break it apart for me.

I know you're trying to be helpful, but for instance what pan64 said I had already understood. I know that's an increment, I know that i is incremented until it reaches NF (by the way, the original script was <NF, not <=NF, which I found strange, because I think it's missing a word if it's at the end of the line, but that's secondary). But it's probably the way awk interprets the results that I don't understand. I'm not sure, really.

So I'd like a more didactic and explicit explanation, if anyone's willing to do that.
So, i is the counter, that's obviously. But if i is the counter, shouldn't it be a number? And yet, there's words[$i] and later one we're talking about words["string"]. Do you know what I mean?:) How is this translation from string to number actually being made?

Someone said that it's a little bit more similar to objective programming, in a way. Does that make sense?

astrogeek 11-21-2016 02:27 PM

I think I see where your confusion originates...

Code:

for (i=1;i<=NF;i++)
    words[$i]++

'i' is the loop counter, which is incremented from 1 to NF. (You appear to be correct about <=NF by the way.)

So you expect the usage words[$i] to be a numeric index, like words[17] when i=17. But that is not correct.

Remember: In awk expressions, the $ operator always references fields of the input stream. When awk sees the $ operator it expects it to be followed by a number and so parses anything else as a variable name and evaluates the value as a number. Non-numeric string values evaluate numerically to zero.

So suppose the 17th field to be the word "with", and i=17, then the expression words[$i] becomes words[$17] which evaluates to words[with], the value of which is then incremented.

The result is an associative array named words, the indexes of which are the input words and the values of which are the accumulated counts per word.

Hope that helps!

astrogeek 11-21-2016 03:04 PM

Moved: This thread is more suitable in Programming and has been moved accordingly to help your thread/question get the exposure it deserves.

vincix 11-21-2016 03:09 PM

Great explanation! That's exactly what I wanted :) So that was the whole idea - the way in which awk processes the script and the meaning of $. That's one of things I thought I understood, but only now do I understand it. I thought $i was simply invoking the variable i. I didn't consider it in the awk context. So that's an essential distinction :)

But even if $ references fields, awk still processes the whole line before going to the next one, right? I mean, it works like sed from this point of view, or does it not?

Thanks! I'm happy someone actually understood what I was trying to say :)

(thanks for moving the thread. I didn't even know "Programming" actually existed :) )

astrogeek 11-21-2016 03:40 PM

You are welcome!

Quote:

Originally Posted by vincix (Post 5633140)
But even if $ references fields, awk still processes the whole line before going to the next one, right? I mean, it works like sed from this point of view, or does it not?

The awk programming model has three major parts, two of which are optional. It looks something like this:

Code:

BEGIN{ /*This block, if present, is executed once before the input is processed... */ }
_______________________________________________________________________________________
Main loop, may include multiple blocks and is processed once per line of input
----------- (line 1)
----------- (line 2)
----------- ...
----------- (line n)
_______________________________________________________________________________________
END{ /*This block, if present, is executed once after all input has been processed... */ }

In your script there is no BEGIN block, the for loop constitutes the main loop, and the END block prints the final result. So to answer your question, the main loop processes once per line similar to sed, yes.

MadeInGermany 11-21-2016 04:46 PM

At the END there is the challenge to print the whole hashed (text-addressed) array.
(Where the stored values are numbers.)
The usual way is to loop over all the (text-)keys
Code:

END { for (key in words) { printf "%s=%d\n",key,words[key] } }

danielbmartin 11-21-2016 07:51 PM

It is possible to dispense with the for loop and the issue of $1, $2, $3, etc.

With this InFile ...
Code:

Once upon a midnight dreary, while I pondered weak and weary,
Over many a quaint and curious volume of forgotten lore,
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
''Tis some visitor,' I muttered, 'tapping at my chamber door -
Only this, and nothing more.'

... this awk ...
Code:

awk 'BEGIN {RS="[[:space:]|[:punct:]]"}
      {a[$0]++}
    END{print  "rapping",a["rapping"],
              "\ntapping",a["tapping"],
              "\nchamber",a["chamber"]}'  \
$InFile >$OutFile

... produced this OutFile ...
Code:

rapping 2
tapping 2
chamber 2

Daniel B. Martin

ntubski 11-22-2016 09:10 AM

Quote:

Originally Posted by vincix (Post 5633140)
But even if $ references fields, awk still processes the whole line before going to the next one, right? I mean, it works like sed from this point of view, or does it not?

Awk works one record at a time. Records are lines by default, but you can change that by changing RS, the record separator (as in danielbmartin's solution).

https://www.gnu.org/software/gawk/ma...e/Records.html

danielbmartin 11-22-2016 08:34 PM

The solution offered in post #12 is unsatisfactory for words containing embedded punctuation such as won't or don't. I attempted to correct this shortcoming but cannot figure out the syntax.

With this InFile ...
Code:

John likes chicken, prefers turkey, and won't eat ham.
Luke likes chicken McNuggets.

... this awk ...
Code:

awk 'BEGIN {RS="[[:space:]]"}
      {gsub(/^[[:punct:]]|[[:punct:]]$/,""); a[$0]++}
    END{print    "chicken", a["chicken"],
                "\nturkey", a["turkey"],
                "\nham",    a["ham"],
                "\nwon''t",  a["won''t"]}'  \
$InFile >$OutFile

... produced this OutFile ...
Code:

chicken 2
turkey 1
ham 1
wont

Please advise.

Daniel B. Martin

grail 11-22-2016 09:41 PM

@Daniel - try using '\B' in gsub instead of anchors


All times are GMT -5. The time now is 01:19 AM.