Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
 Home Forums HCL Reviews Tutorials Articles Register Search Today's Posts Mark Forums Read
 LinuxQuestions.org [SOLVED] awk distinct count
 Linux - Newbie This Linux forum is for members that are new to Linux. Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

 09-04-2010, 08:00 PM #1 HuJo LQ Newbie   Registered: Sep 2010 Posts: 5 Rep: awk distinct count So, I've learned a great deal from the responses to my previous posting, particularly about arrays in awk (mighty thanks to all who responded). However I'm now stuck on trying to develop what I think is solved with a nested loop, although I'm sure it's probably a much simpler solution. Let's say I have a file like the following: Joe Wolfhound Joe Wolfhound Joe Beagle Mary Pug Mary Dalmation Joe Chihuahua Mary Boxer Jane Husky Jane Husky Joe Bulldog Jane Ridgeback Mary Malamute Mary Boxer Joe Chow Paul Doberman Paul Doberman Paul Bernese How do I find the number of breeds each person owns? I'm able to determine the number of unique breeds: { breed[\$2] += 1 } END { for (i in breed) print i, breed[i] } as well as the number of dogs per owner, but determining the number of distinct breeds per owner is giving me problems Thanks all in advance!
 09-04-2010, 09:02 PM #2 crts Senior Member   Registered: Jan 2010 Posts: 1,608 Rep: Hi, this worked for me Code: ```awk ' BEGIN { breed[""]=0; owner[""]=0; } { if (owner[\$1 \$2] == 0) { owner[\$1 \$2]++; breed[\$1]+=1; } } END { for (i in breed) { if (i != "") { print breed[i],i; } } }' infile``` Last edited by crts; 09-04-2010 at 09:07 PM. Reason: switched variable names to make more sense 1 members found this post helpful.
 09-04-2010, 09:41 PM #3 ghostdog74 Senior Member   Registered: Aug 2006 Posts: 2,697 Blog Entries: 5 Rep: Code: ```awk '{a=\$1;\$1=""}(!(breeds[a]~\$0 )){ breeds[a]=breeds[a]"|"\$0} END{for(i in breeds){m=split(breeds[i],b,"|"); print i,breeds[i],m-1 }}' file``` And if you have Python, Code: ```from collections import defaultdict breeds=defaultdict(list) with open("file") as f: for line in f: name,breed=line.rstrip().split(" ",1) breeds[name].append(breed) for k,v in breeds.iteritems(): print k,len(set(v))``` output Code: ```\$ ./python.py Jane 2 Paul 2 Joe 5 Mary 4``` Last edited by ghostdog74; 09-04-2010 at 10:43 PM. 1 members found this post helpful.
 09-04-2010, 09:53 PM #4 joec@home Member   Registered: Sep 2009 Location: Galveston Tx Posts: 291 Rep: I have been playing with the concept of embedding LISP style commands in bash. LISP is an early Artificial Intelligent programming language. The concept is to write scripts that write their own scripts. So lets start by declaring the variables so they are defined as integers with the value of zero. Your list is in a file called "list" Code: ```cat list |awk '{print \$1 \$2 "='0'" }' JoeWolfhound=0 JoeWolfhound=0 JoeBeagle=0 MaryPug=0 MaryDalmation=0 JoeChihuahua=0 MaryBoxer=0 JaneHusky=0 JaneHusky=0 JoeBulldog=0 JaneRidgeback=0 MaryMalamute=0 MaryBoxer=0 JoeChow=0 PaulDoberman=0 PaulDoberman=0 PaulBernese=0``` Now for a problem this complex we can not simply pipe the commands to bash because the variables are local and as soon as each bash session ends the variables will be lost. So we need to build a global variable array number 1. The echo "" at the end is needed for one last carriage return. Code: `for i in \$(cat list) ; do COMMAND[1]=\$( cat list |awk '{print \$1 \$2 "='0'" }' ; echo " " ) ; done` The cool thing is instead of writing all this code, we are generating the code by examining the data. As such we can preview the purposed script that is being developed. Now we go through and add the functions together. Code: ```cat list |awk '{print \$1 \$2 "=\$((\$"\$1 \$2 "+1))" }' JoeWolfhound=\$((\$JoeWolfhound+1)) JoeWolfhound=\$((\$JoeWolfhound+1)) JoeBeagle=\$((\$JoeBeagle+1)) MaryPug=\$((\$MaryPug+1)) MaryDalmation=\$((\$MaryDalmation+1)) JoeChihuahua=\$((\$JoeChihuahua+1)) MaryBoxer=\$((\$MaryBoxer+1)) JaneHusky=\$((\$JaneHusky+1)) JaneHusky=\$((\$JaneHusky+1)) JoeBulldog=\$((\$JoeBulldog+1)) JaneRidgeback=\$((\$JaneRidgeback+1)) MaryMalamute=\$((\$MaryMalamute+1)) MaryBoxer=\$((\$MaryBoxer+1)) JoeChow=\$((\$JoeChow+1)) PaulDoberman=\$((\$PaulDoberman+1)) PaulDoberman=\$((\$PaulDoberman+1)) PaulBernese=\$((\$PaulBernese+1))``` Now we will put this into an array number 2. Code: `for i in \$(cat list) ; do COMMAND[2]=\$( cat list |awk '{print \$1 \$2 "=\$((\$"\$1 \$2 "+1))" }' ; echo " " ) ; done` Now we need a unique listing of the variables that have been built with the commands to display the results. Code: ```cat list |awk '{print \$1 \$2}' | uniq | awk '{ print "echo "\$0 " \$"\$0}' echo JoeWolfhound \$JoeWolfhound echo JoeBeagle \$JoeBeagle echo MaryPug \$MaryPug echo MaryDalmation \$MaryDalmation echo JoeChihuahua \$JoeChihuahua echo MaryBoxer \$MaryBoxer echo JaneHusky \$JaneHusky echo JoeBulldog \$JoeBulldog echo JaneRidgeback \$JaneRidgeback echo MaryMalamute \$MaryMalamute echo MaryBoxer \$MaryBoxer echo JoeChow \$JoeChow echo PaulDoberman \$PaulDoberman echo PaulBernese \$PaulBernese``` Again we will put this into an array number 3. Code: `for i in \$(cat list) ; do COMMAND[3]=\$( cat list |awk '{print \$1 \$2}' | uniq | awk '{ print "echo "\$0 " \$"\$0}' ; echo " " ) ; done` So now we just check the entire array we have a self built script. Code: ```echo "\${COMMAND[@]}" JoeWolfhound=0 JoeWolfhound=0 JoeBeagle=0 MaryPug=0 MaryDalmation=0 JoeChihuahua=0 MaryBoxer=0 JaneHusky=0 JaneHusky=0 JoeBulldog=0 JaneRidgeback=0 MaryMalamute=0 MaryBoxer=0 JoeChow=0 PaulDoberman=0 PaulDoberman=0 PaulBernese=0 JoeWolfhound=\$((\$JoeWolfhound+1)) JoeWolfhound=\$((\$JoeWolfhound+1)) JoeBeagle=\$((\$JoeBeagle+1)) MaryPug=\$((\$MaryPug+1)) MaryDalmation=\$((\$MaryDalmation+1)) JoeChihuahua=\$((\$JoeChihuahua+1)) MaryBoxer=\$((\$MaryBoxer+1)) JaneHusky=\$((\$JaneHusky+1)) JaneHusky=\$((\$JaneHusky+1)) JoeBulldog=\$((\$JoeBulldog+1)) JaneRidgeback=\$((\$JaneRidgeback+1)) MaryMalamute=\$((\$MaryMalamute+1)) MaryBoxer=\$((\$MaryBoxer+1)) JoeChow=\$((\$JoeChow+1)) PaulDoberman=\$((\$PaulDoberman+1)) PaulDoberman=\$((\$PaulDoberman+1)) PaulBernese=\$((\$PaulBernese+1)) echo JoeWolfhound \$JoeWolfhound echo JoeBeagle \$JoeBeagle echo MaryPug \$MaryPug echo MaryDalmation \$MaryDalmation echo JoeChihuahua \$JoeChihuahua echo MaryBoxer \$MaryBoxer echo JaneHusky \$JaneHusky echo JoeBulldog \$JoeBulldog echo JaneRidgeback \$JaneRidgeback echo MaryMalamute \$MaryMalamute echo MaryBoxer \$MaryBoxer echo JoeChow \$JoeChow echo PaulDoberman \$PaulDoberman echo PaulBernese \$PaulBernese``` Now we simply pipe the array into bash. We will add the unset command to flush out the variables and start over. Code: ```unset COMMAND for i in \$(cat list) ; do COMMAND[1]=\$( cat list |awk '{print \$1 \$2 "='0'" }' ; echo " " ) ; done for i in \$(cat list) ; do COMMAND[2]=\$( cat list |awk '{print \$1 \$2 "=\$((\$"\$1 \$2 "+1))" }' ; echo " " ) ; done for i in \$(cat list) ; do COMMAND[3]=\$( cat list |awk '{print \$1 \$2}' | uniq | awk '{ print "echo "\$0 " \$"\$0}' ; echo " " ) ; done echo "\${COMMAND[@]}"| bash | sort JaneHusky 2 JaneRidgeback 1 JoeBeagle 1 JoeBulldog 1 JoeChihuahua 1 JoeChow 1 JoeWolfhound 2 MaryBoxer 2 MaryBoxer 2 MaryDalmation 1 MaryMalamute 1 MaryPug 1 PaulBernese 1 PaulDoberman 2``` Ok so it would take a bit more work to make "Joe Wolfhound" instead of "JoeWolfhound" but you get the idea. Why write code to do work, when you can write goes to figure out the work for you? Last edited by joec@home; 09-04-2010 at 09:56 PM.
 09-04-2010, 09:58 PM #5 syg00 LQ Veteran   Registered: Aug 2003 Location: Australia Distribution: Lots ... Posts: 15,533 Rep: Better hope there aren't any (blank separated) names - say "German Shepherd" (referencing the awk offerings) Last edited by syg00; 09-04-2010 at 09:59 PM. Reason: reference
 09-04-2010, 10:09 PM #6 HuJo LQ Newbie   Registered: Sep 2010 Posts: 5 Original Poster Rep: Wow You guys are fast... these are greatly helpful! So I'm looking at the response provided by crts (you rock, BTW) to try and understand what is happening and I'm wrapped around the axle on this: Code: ` if (owner[\$1 \$2] == 0) {` is this a multidimensional array? (with no comma?) I'm new to awk (which should be obvious) and don't quite understand how this works. Thanks again!
09-04-2010, 10:27 PM   #7
crts
Senior Member

Registered: Jan 2010
Posts: 1,608

Rep:
Quote:
 Originally Posted by HuJo You guys are fast... these are greatly helpful! So I'm looking at the response provided by crts (you rock, BTW) to try and understand what is happening and I'm wrapped around the axle on this: Code: ` if (owner[\$1 \$2] == 0) {` is this a multidimensional array? (with no comma?) I'm new to awk (which should be obvious) and don't quite understand how this works. Thanks again!
Hi,

this will create an array field which can be accessed by the index of whatever is in \$1 and \$2. So when the first line
Joe Wolfhound

is processed then it will create
JoeWolfhound

as index. You can access the arrays value at this index with
owner[JoeWolfhound]

Now if this field does not exist it is implicitly initialized to '0'. In this case we increment the value and hence mark this combination as counted. The next time we see Joe and his wolfhound the comparison evaluates to false and it does not get counted. The counting itself happens in the array breed.

Hope this clears things up a bit.

Last edited by crts; 09-04-2010 at 10:39 PM.

1 members found this post helpful.
09-04-2010, 10:30 PM   #8
joec@home
Member

Registered: Sep 2009
Location: Galveston Tx
Posts: 291

Rep:
Quote:
 Originally Posted by syg00 Better hope there aren't any (blank separated) names - say "German Shepherd" (referencing the awk offerings)
Oh we can go on with all the different problems that can happen on that one. On some thing I actually had taken time to develop I had to write a subroutine that checked for CaSeSeNsItIvE so all the variables were not duplicated in different cases.

09-04-2010, 10:32 PM   #9
crts
Senior Member

Registered: Jan 2010
Posts: 1,608

Rep:
Quote:
 Originally Posted by syg00 Better hope there aren't any (blank separated) names - say "German Shepherd" (referencing the awk offerings)
Ok,

this also copes with "German Shepherds"
Code:
```awk '
{
if (owner[\$0] == 0) {
owner[\$0]++;
breed[\$1]+=1;
}
}
END {
for (i in breed) {
if (i != "") {
print breed[i],i;
}
}
}' infile```
Tested with
Code:
```Joe Wolfhound
Joe Wolfhound
Joe Beagle
Mary Pug
Mary Dalmation
Joe Chihuahua
Mary Boxer
Jane Husky
Jane Husky
Joe Bulldog
Jane Ridgeback
Mary Malamute
Mary Boxer
Joe Chow
Paul Doberman
Paul Doberman
Paul Bernese
Joe German Shepard
Joe German Doberman```
I think I made the last race up. Just needed some sample data. Now Joe has 7 breeds.

Last edited by crts; 09-04-2010 at 10:35 PM. Reason: array initialisation removed

1 members found this post helpful.
 09-04-2010, 10:54 PM #10 HuJo LQ Newbie   Registered: Sep 2010 Posts: 5 Original Poster Rep: Awesome. That makes total sense to me now. I hadn't seen that trick in any of the on-line awk tutorials yet. Thanks again! Just for the record (not that it makes a difference): - I'm working mostly with csv files so the space isn't an issue (I just made up the example input file to make the question easier), but it's good to know anyway. - Dobermans are actually a breed of German origin I believe... nice coincidence!
 09-04-2010, 11:46 PM #11 grail LQ Guru   Registered: Sep 2009 Location: Perth Distribution: Manjaro Posts: 9,493 Rep: Well here is one that should satisfy even syg00 Code: `awk '!a[\$0]++{b[\$1]++}END{for(x in b) print x,"has",b[x],"breeds"}' infile` 1 members found this post helpful.

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is Off HTML code is Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post S1GNZ Linux - Newbie 11 12-11-2010 03:42 PM kj6loh Programming 1 09-07-2009 09:50 PM jdwilder Programming 1 07-18-2006 11:38 AM ridertech Linux - Newbie 1 05-07-2004 05:07 PM alaios Linux - General 8 05-13-2003 06:41 AM

All times are GMT -5. The time now is 05:11 PM.

 Contact Us - Advertising Info - Rules - LQ Merchandise - Donations - Contributing Member - LQ Sitemap -