LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-04-2010, 09:00 PM   #1
HuJo
LQ Newbie
 
Registered: Sep 2010
Posts: 5

Rep: Reputation: 0
awk distinct count


So, I've learned a great deal from the responses to my previous posting, particularly about arrays in awk (mighty thanks to all who responded). However I'm now stuck on trying to develop what I think is solved with a nested loop, although I'm sure it's probably a much simpler solution.

Let's say I have a file like the following:

Joe Wolfhound
Joe Wolfhound
Joe Beagle
Mary Pug
Mary Dalmation
Joe Chihuahua
Mary Boxer
Jane Husky
Jane Husky
Joe Bulldog
Jane Ridgeback
Mary Malamute
Mary Boxer
Joe Chow
Paul Doberman
Paul Doberman
Paul Bernese

How do I find the number of breeds each person owns? I'm able to determine the number of unique breeds:
{ breed[$2] += 1 } END { for (i in breed) print i, breed[i] }

as well as the number of dogs per owner, but determining the number of distinct breeds per owner is giving me problems

Thanks all in advance!
 
Old 09-04-2010, 10:02 PM   #2
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,606

Rep: Reputation: 448Reputation: 448Reputation: 448Reputation: 448Reputation: 448
Hi,

this worked for me
Code:
awk '
BEGIN {
    breed[""]=0; owner[""]=0;
}
{
    if (owner[$1 $2] == 0) {
       owner[$1 $2]++;
       breed[$1]+=1;
    }
}
END {
    for (i in breed) {
       if (i != "") {
          print breed[i],i;
       }
   }
}' infile

Last edited by crts; 09-04-2010 at 10:07 PM. Reason: switched variable names to make more sense
 
1 members found this post helpful.
Old 09-04-2010, 10:41 PM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Code:
awk '{a=$1;$1=""}(!(breeds[a]~$0 )){
 breeds[a]=breeds[a]"|"$0}
END{for(i in breeds){m=split(breeds[i],b,"|"); print i,breeds[i],m-1 }}' file
And if you have Python,

Code:
from collections import defaultdict
breeds=defaultdict(list)
with open("file") as f:
    for line in f:
        name,breed=line.rstrip().split(" ",1)
        breeds[name].append(breed)
for k,v in breeds.iteritems():
    print k,len(set(v))
output
Code:
$ ./python.py
Jane 2
Paul 2
Joe 5
Mary 4

Last edited by ghostdog74; 09-04-2010 at 11:43 PM.
 
1 members found this post helpful.
Old 09-04-2010, 10:53 PM   #4
joec@home
Member
 
Registered: Sep 2009
Location: Galveston Tx
Posts: 291

Rep: Reputation: 70
I have been playing with the concept of embedding LISP style commands in bash. LISP is an early Artificial Intelligent programming language. The concept is to write scripts that write their own scripts. So lets start by declaring the variables so they are defined as integers with the value of zero. Your list is in a file called "list"

Code:
cat list |awk '{print $1 $2 "='0'" }' 
JoeWolfhound=0
JoeWolfhound=0
JoeBeagle=0
MaryPug=0
MaryDalmation=0
JoeChihuahua=0
MaryBoxer=0
JaneHusky=0
JaneHusky=0
JoeBulldog=0
JaneRidgeback=0
MaryMalamute=0
MaryBoxer=0
JoeChow=0
PaulDoberman=0
PaulDoberman=0
PaulBernese=0
Now for a problem this complex we can not simply pipe the commands to bash because the variables are local and as soon as each bash session ends the variables will be lost. So we need to build a global variable array number 1. The echo "" at the end is needed for one last carriage return.

Code:
for i in $(cat list) ; do COMMAND[1]=$( cat list |awk '{print $1 $2 "='0'" }' ; echo " " ) ; done
The cool thing is instead of writing all this code, we are generating the code by examining the data. As such we can preview the purposed script that is being developed. Now we go through and add the functions together.

Code:
cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }' 
JoeWolfhound=$(($JoeWolfhound+1))
JoeWolfhound=$(($JoeWolfhound+1))
JoeBeagle=$(($JoeBeagle+1))
MaryPug=$(($MaryPug+1))
MaryDalmation=$(($MaryDalmation+1))
JoeChihuahua=$(($JoeChihuahua+1))
MaryBoxer=$(($MaryBoxer+1))
JaneHusky=$(($JaneHusky+1))
JaneHusky=$(($JaneHusky+1))
JoeBulldog=$(($JoeBulldog+1))
JaneRidgeback=$(($JaneRidgeback+1))
MaryMalamute=$(($MaryMalamute+1))
MaryBoxer=$(($MaryBoxer+1))
JoeChow=$(($JoeChow+1))
PaulDoberman=$(($PaulDoberman+1))
PaulDoberman=$(($PaulDoberman+1))
PaulBernese=$(($PaulBernese+1))
Now we will put this into an array number 2.

Code:
for i in $(cat list) ; do COMMAND[2]=$( cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }' ; echo " " ) ; done
Now we need a unique listing of the variables that have been built with the commands to display the results.

Code:
cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}'
echo JoeWolfhound $JoeWolfhound
echo JoeBeagle $JoeBeagle
echo MaryPug $MaryPug
echo MaryDalmation $MaryDalmation
echo JoeChihuahua $JoeChihuahua
echo MaryBoxer $MaryBoxer
echo JaneHusky $JaneHusky
echo JoeBulldog $JoeBulldog
echo JaneRidgeback $JaneRidgeback
echo MaryMalamute $MaryMalamute
echo MaryBoxer $MaryBoxer
echo JoeChow $JoeChow
echo PaulDoberman $PaulDoberman
echo PaulBernese $PaulBernese
Again we will put this into an array number 3.

Code:
for i in $(cat list) ; do COMMAND[3]=$( cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}' ; echo " " ) ; done
So now we just check the entire array we have a self built script.

Code:
echo "${COMMAND[@]}"
JoeWolfhound=0
JoeWolfhound=0
JoeBeagle=0
MaryPug=0
MaryDalmation=0
JoeChihuahua=0
MaryBoxer=0
JaneHusky=0
JaneHusky=0
JoeBulldog=0
JaneRidgeback=0
MaryMalamute=0
MaryBoxer=0
JoeChow=0
PaulDoberman=0
PaulDoberman=0
PaulBernese=0
  JoeWolfhound=$(($JoeWolfhound+1))
JoeWolfhound=$(($JoeWolfhound+1))
JoeBeagle=$(($JoeBeagle+1))
MaryPug=$(($MaryPug+1))
MaryDalmation=$(($MaryDalmation+1))
JoeChihuahua=$(($JoeChihuahua+1))
MaryBoxer=$(($MaryBoxer+1))
JaneHusky=$(($JaneHusky+1))
JaneHusky=$(($JaneHusky+1))
JoeBulldog=$(($JoeBulldog+1))
JaneRidgeback=$(($JaneRidgeback+1))
MaryMalamute=$(($MaryMalamute+1))
MaryBoxer=$(($MaryBoxer+1))
JoeChow=$(($JoeChow+1))
PaulDoberman=$(($PaulDoberman+1))
PaulDoberman=$(($PaulDoberman+1))
PaulBernese=$(($PaulBernese+1))
  echo JoeWolfhound $JoeWolfhound
echo JoeBeagle $JoeBeagle
echo MaryPug $MaryPug
echo MaryDalmation $MaryDalmation
echo JoeChihuahua $JoeChihuahua
echo MaryBoxer $MaryBoxer
echo JaneHusky $JaneHusky
echo JoeBulldog $JoeBulldog
echo JaneRidgeback $JaneRidgeback
echo MaryMalamute $MaryMalamute
echo MaryBoxer $MaryBoxer
echo JoeChow $JoeChow
echo PaulDoberman $PaulDoberman
echo PaulBernese $PaulBernese
Now we simply pipe the array into bash. We will add the unset command to flush out the variables and start over.

Code:
unset COMMAND
for i in $(cat list) ; do COMMAND[1]=$( cat list |awk '{print $1 $2 "='0'" }' ; echo " " ) ; done
for i in $(cat list) ; do COMMAND[2]=$( cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }' ; echo " " ) ; done
for i in $(cat list) ; do COMMAND[3]=$( cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}' ; echo " " ) ; done
echo "${COMMAND[@]}"| bash | sort
JaneHusky 2
JaneRidgeback 1
JoeBeagle 1
JoeBulldog 1
JoeChihuahua 1
JoeChow 1
JoeWolfhound 2
MaryBoxer 2
MaryBoxer 2
MaryDalmation 1
MaryMalamute 1
MaryPug 1
PaulBernese 1
PaulDoberman 2
Ok so it would take a bit more work to make "Joe Wolfhound" instead of "JoeWolfhound" but you get the idea. Why write code to do work, when you can write goes to figure out the work for you?

Last edited by joec@home; 09-04-2010 at 10:56 PM.
 
Old 09-04-2010, 10:58 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,840

Rep: Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823Reputation: 1823
Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)

Last edited by syg00; 09-04-2010 at 10:59 PM. Reason: reference
 
Old 09-04-2010, 11:09 PM   #6
HuJo
LQ Newbie
 
Registered: Sep 2010
Posts: 5

Original Poster
Rep: Reputation: 0
Wow

You guys are fast... these are greatly helpful!
So I'm looking at the response provided by crts (you rock, BTW) to try and understand what is happening and I'm wrapped around the axle on this:
Code:
    if (owner[$1 $2] == 0) {
is this a multidimensional array? (with no comma?) I'm new to awk (which should be obvious) and don't quite understand how this works.

Thanks again!
 
Old 09-04-2010, 11:27 PM   #7
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,606

Rep: Reputation: 448Reputation: 448Reputation: 448Reputation: 448Reputation: 448
Quote:
Originally Posted by HuJo View Post
You guys are fast... these are greatly helpful!
So I'm looking at the response provided by crts (you rock, BTW) to try and understand what is happening and I'm wrapped around the axle on this:
Code:
    if (owner[$1 $2] == 0) {
is this a multidimensional array? (with no comma?) I'm new to awk (which should be obvious) and don't quite understand how this works.

Thanks again!
Hi,

this will create an array field which can be accessed by the index of whatever is in $1 and $2. So when the first line
Joe Wolfhound

is processed then it will create
JoeWolfhound

as index. You can access the arrays value at this index with
owner[JoeWolfhound]

Now if this field does not exist it is implicitly initialized to '0'. In this case we increment the value and hence mark this combination as counted. The next time we see Joe and his wolfhound the comparison evaluates to false and it does not get counted. The counting itself happens in the array breed.

Hope this clears things up a bit.

Last edited by crts; 09-04-2010 at 11:39 PM.
 
1 members found this post helpful.
Old 09-04-2010, 11:30 PM   #8
joec@home
Member
 
Registered: Sep 2009
Location: Galveston Tx
Posts: 291

Rep: Reputation: 70
Quote:
Originally Posted by syg00 View Post
Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)
Oh we can go on with all the different problems that can happen on that one. On some thing I actually had taken time to develop I had to write a subroutine that checked for CaSeSeNsItIvE so all the variables were not duplicated in different cases.
 
Old 09-04-2010, 11:32 PM   #9
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,606

Rep: Reputation: 448Reputation: 448Reputation: 448Reputation: 448Reputation: 448
Quote:
Originally Posted by syg00 View Post
Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)
Ok,

this also copes with "German Shepherds"
Code:
awk '
{
    if (owner[$0] == 0) {
       owner[$0]++;
       breed[$1]+=1;
    }
}
END {
    for (i in breed) {
       if (i != "") {
          print breed[i],i;
       }
   }
}' infile
Tested with
Code:
Joe Wolfhound
Joe Wolfhound
Joe Beagle
Mary Pug
Mary Dalmation
Joe Chihuahua
Mary Boxer
Jane Husky
Jane Husky
Joe Bulldog
Jane Ridgeback
Mary Malamute
Mary Boxer
Joe Chow
Paul Doberman
Paul Doberman
Paul Bernese
Joe German Shepard
Joe German Doberman
I think I made the last race up. Just needed some sample data. Now Joe has 7 breeds.

Last edited by crts; 09-04-2010 at 11:35 PM. Reason: array initialisation removed
 
1 members found this post helpful.
Old 09-04-2010, 11:54 PM   #10
HuJo
LQ Newbie
 
Registered: Sep 2010
Posts: 5

Original Poster
Rep: Reputation: 0
Awesome. That makes total sense to me now. I hadn't seen that trick in any of the on-line awk tutorials yet. Thanks again!

Just for the record (not that it makes a difference):
- I'm working mostly with csv files so the space isn't an issue (I just made up the example input file to make the question easier), but it's good to know anyway.
- Dobermans are actually a breed of German origin I believe... nice coincidence!
 
Old 09-05-2010, 12:46 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,253

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
Well here is one that should satisfy even syg00
Code:
awk '!a[$0]++{b[$1]++}END{for(x in b) print x,"has",b[x],"breeds"}' infile
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Count domains with AWK S1GNZ Linux - Newbie 11 12-11-2010 04:42 PM
AWK count kj6loh Programming 1 09-07-2009 10:50 PM
Matlab comman, count distinct items in a vector jdwilder Programming 1 07-18-2006 12:38 PM
AWK: print field to end, and character count? ridertech Linux - Newbie 1 05-07-2004 06:07 PM
count bytes with awk alaios Linux - General 8 05-13-2003 07:41 AM


All times are GMT -5. The time now is 10:57 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration