LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-21-2016, 08:24 AM   #1
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240

Rep: Reputation: 103Reputation: 103
awk script - print number of words


I have the following script (from lynda's bash course) in a file called awk_script:
Quote:
{for (i=1;i<=NF;i++)
words[$i]++}
END{printf("is=%d,ls=%d,the=%d,with=%d\n", words["is"],words["ls"],words["the"],words["with"])}
The point of the script is to number the occurences of the respective words.

And I run the following:
Code:
man ls | col -b | awk -f awk_script
And I get some results.

My question is, how are the strings actually turned into digits? I understand "i" is used as a counter for "words". And then words are taken one at a time, so as the variable "words" is filled with these values. But it seems that this array variable already contains digits, and not strings, or am I wrong? And if so, how does it recognise a variable such as words["ls"]? How does it know to associate that with the number of occurences?
 
Old 11-21-2016, 08:45 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,843

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
$1 is the first word of the line, $2 is the second, $3 is the third.... $NF is the last.
words[] is something called associative array, it will look like:
words["is"], words["ls"] and so on and all of them are counters which will be incremented by words[$i]++.
 
Old 11-21-2016, 09:11 AM   #3
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240

Original Poster
Rep: Reputation: 103Reputation: 103
Quote:
Originally Posted by pan64 View Post
$1 is the first word of the line, $2 is the second, $3 is the third.... $NF is the last.
words[] is something called associative array, it will look like:
words["is"], words["ls"] and so on and all of them are counters which will be incremented by words[$i]++.

How come all of them are counters? Is that something specific to arrays?
 
Old 11-21-2016, 09:29 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,843

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
no, they are incremented (that is ++), so they will be used as counters of occurrence of the indices of the array (which are actually the words appeared in the input text)
 
Old 11-21-2016, 11:10 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Instead of running code on such a large amount of data, use a small subset and check the output for yourself

As an example, from your own post:
Code:
echo 'My question is, how are the strings actually turned into digits? I understand "i" is used as a counter for "words".' | ./vincix.awk 
is=1,ls=0,the=1,with=0
 
Old 11-21-2016, 01:39 PM   #6
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240

Original Poster
Rep: Reputation: 103Reputation: 103
I simply don't understand. I've been trying to understand this script for some time, I haven't come to this forum right away. But there's something more complicated in here which might seem obvious to someone more knowledgeable, but not to me.

@grail the code is actually very small. I understand the printf part perfectly. What I really don't understand is how the words[$i] works exactly, that's why I was looking for someone who might break it apart for me.

I know you're trying to be helpful, but for instance what pan64 said I had already understood. I know that's an increment, I know that i is incremented until it reaches NF (by the way, the original script was <NF, not <=NF, which I found strange, because I think it's missing a word if it's at the end of the line, but that's secondary). But it's probably the way awk interprets the results that I don't understand. I'm not sure, really.

So I'd like a more didactic and explicit explanation, if anyone's willing to do that.
So, i is the counter, that's obviously. But if i is the counter, shouldn't it be a number? And yet, there's words[$i] and later one we're talking about words["string"]. Do you know what I mean? How is this translation from string to number actually being made?

Someone said that it's a little bit more similar to objective programming, in a way. Does that make sense?

Last edited by vincix; 11-21-2016 at 01:41 PM.
 
Old 11-21-2016, 02:27 PM   #7
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,264
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
I think I see where your confusion originates...

Code:
for (i=1;i<=NF;i++)
    words[$i]++
'i' is the loop counter, which is incremented from 1 to NF. (You appear to be correct about <=NF by the way.)

So you expect the usage words[$i] to be a numeric index, like words[17] when i=17. But that is not correct.

Remember: In awk expressions, the $ operator always references fields of the input stream. When awk sees the $ operator it expects it to be followed by a number and so parses anything else as a variable name and evaluates the value as a number. Non-numeric string values evaluate numerically to zero.

So suppose the 17th field to be the word "with", and i=17, then the expression words[$i] becomes words[$17] which evaluates to words[with], the value of which is then incremented.

The result is an associative array named words, the indexes of which are the input words and the values of which are the accumulated counts per word.

Hope that helps!
 
4 members found this post helpful.
Old 11-21-2016, 03:04 PM   #8
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,264
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
Moved: This thread is more suitable in Programming and has been moved accordingly to help your thread/question get the exposure it deserves.
 
Old 11-21-2016, 03:09 PM   #9
vincix
Senior Member
 
Registered: Feb 2011
Distribution: Ubuntu, Centos
Posts: 1,240

Original Poster
Rep: Reputation: 103Reputation: 103
Great explanation! That's exactly what I wanted So that was the whole idea - the way in which awk processes the script and the meaning of $. That's one of things I thought I understood, but only now do I understand it. I thought $i was simply invoking the variable i. I didn't consider it in the awk context. So that's an essential distinction

But even if $ references fields, awk still processes the whole line before going to the next one, right? I mean, it works like sed from this point of view, or does it not?

Thanks! I'm happy someone actually understood what I was trying to say

(thanks for moving the thread. I didn't even know "Programming" actually existed )

Last edited by vincix; 11-21-2016 at 03:19 PM.
 
Old 11-21-2016, 03:40 PM   #10
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=15, FreeBSD_12{.0|.1}
Posts: 6,264
Blog Entries: 24

Rep: Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194Reputation: 4194
You are welcome!

Quote:
Originally Posted by vincix View Post
But even if $ references fields, awk still processes the whole line before going to the next one, right? I mean, it works like sed from this point of view, or does it not?
The awk programming model has three major parts, two of which are optional. It looks something like this:

Code:
BEGIN{ /*This block, if present, is executed once before the input is processed... */ }
_______________________________________________________________________________________
Main loop, may include multiple blocks and is processed once per line of input
----------- (line 1)
----------- (line 2)
----------- ...
----------- (line n)
_______________________________________________________________________________________
END{ /*This block, if present, is executed once after all input has been processed... */ }
In your script there is no BEGIN block, the for loop constitutes the main loop, and the END block prints the final result. So to answer your question, the main loop processes once per line similar to sed, yes.
 
2 members found this post helpful.
Old 11-21-2016, 04:46 PM   #11
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,792

Rep: Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201
At the END there is the challenge to print the whole hashed (text-addressed) array.
(Where the stored values are numbers.)
The usual way is to loop over all the (text-)keys
Code:
END { for (key in words) { printf "%s=%d\n",key,words[key] } }
 
1 members found this post helpful.
Old 11-21-2016, 07:51 PM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
It is possible to dispense with the for loop and the issue of $1, $2, $3, etc.

With this InFile ...
Code:
Once upon a midnight dreary, while I pondered weak and weary,
Over many a quaint and curious volume of forgotten lore,
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
''Tis some visitor,' I muttered, 'tapping at my chamber door -
Only this, and nothing more.'
... this awk ...
Code:
awk 'BEGIN {RS="[[:space:]|[:punct:]]"}
       {a[$0]++}
     END{print   "rapping",a["rapping"],
               "\ntapping",a["tapping"],
               "\nchamber",a["chamber"]}'  \
$InFile >$OutFile
... produced this OutFile ...
Code:
rapping 2 
tapping 2 
chamber 2
Daniel B. Martin

Last edited by danielbmartin; 11-21-2016 at 10:14 PM. Reason: Tighten code, slightly
 
2 members found this post helpful.
Old 11-22-2016, 09:10 AM   #13
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,781

Rep: Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081Reputation: 2081
Quote:
Originally Posted by vincix View Post
But even if $ references fields, awk still processes the whole line before going to the next one, right? I mean, it works like sed from this point of view, or does it not?
Awk works one record at a time. Records are lines by default, but you can change that by changing RS, the record separator (as in danielbmartin's solution).

https://www.gnu.org/software/gawk/ma...e/Records.html
 
1 members found this post helpful.
Old 11-22-2016, 08:34 PM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
The solution offered in post #12 is unsatisfactory for words containing embedded punctuation such as won't or don't. I attempted to correct this shortcoming but cannot figure out the syntax.

With this InFile ...
Code:
John likes chicken, prefers turkey, and won't eat ham.
Luke likes chicken McNuggets.
... this awk ...
Code:
awk 'BEGIN {RS="[[:space:]]"}
       {gsub(/^[[:punct:]]|[[:punct:]]$/,""); a[$0]++}
     END{print    "chicken", a["chicken"],
                 "\nturkey", a["turkey"],
                 "\nham",    a["ham"],
                 "\nwon''t",  a["won''t"]}'  \
$InFile >$OutFile
... produced this OutFile ...
Code:
chicken 2 
turkey 1 
ham 1 
wont
Please advise.

Daniel B. Martin
 
Old 11-22-2016, 09:41 PM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
@Daniel - try using '\B' in gsub instead of anchors
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
awk script for having number of beacon sent for gpsr ranjani Linux - Networking 1 04-03-2014 09:07 AM
how to print an apostrophe (') in a shell script using awk? skuz_ball Programming 11 03-10-2012 08:26 AM
[SOLVED] How to read an external file and print it with a definite format within AWK script tia_chofi Linux - Newbie 2 12-13-2011 04:26 AM
[SOLVED] Awk script to look for string, show value in field 2, if not present print zero Perseus Programming 12 10-06-2011 03:40 AM
sed / awk command to print line number as column? johnpaulodonnell Linux - Newbie 2 01-22-2007 07:07 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:11 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration