LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   awk distinct count (https://www.linuxquestions.org/questions/linux-newbie-8/awk-distinct-count-830388/)

HuJo 09-04-2010 08:00 PM

awk distinct count
 
So, I've learned a great deal from the responses to my previous posting, particularly about arrays in awk (mighty thanks to all who responded). However I'm now stuck on trying to develop what I think is solved with a nested loop, although I'm sure it's probably a much simpler solution.

Let's say I have a file like the following:

Joe Wolfhound
Joe Wolfhound
Joe Beagle
Mary Pug
Mary Dalmation
Joe Chihuahua
Mary Boxer
Jane Husky
Jane Husky
Joe Bulldog
Jane Ridgeback
Mary Malamute
Mary Boxer
Joe Chow
Paul Doberman
Paul Doberman
Paul Bernese

How do I find the number of breeds each person owns? I'm able to determine the number of unique breeds:
{ breed[$2] += 1 } END { for (i in breed) print i, breed[i] }

as well as the number of dogs per owner, but determining the number of distinct breeds per owner is giving me problems

Thanks all in advance!

crts 09-04-2010 09:02 PM

Hi,

this worked for me
Code:

awk '
BEGIN {
    breed[""]=0; owner[""]=0;
}
{
    if (owner[$1 $2] == 0) {
      owner[$1 $2]++;
      breed[$1]+=1;
    }
}
END {
    for (i in breed) {
      if (i != "") {
          print breed[i],i;
      }
  }
}' infile


ghostdog74 09-04-2010 09:41 PM

Code:

awk '{a=$1;$1=""}(!(breeds[a]~$0 )){
 breeds[a]=breeds[a]"|"$0}
END{for(i in breeds){m=split(breeds[i],b,"|"); print i,breeds[i],m-1 }}' file

And if you have Python,

Code:

from collections import defaultdict
breeds=defaultdict(list)
with open("file") as f:
    for line in f:
        name,breed=line.rstrip().split(" ",1)
        breeds[name].append(breed)
for k,v in breeds.iteritems():
    print k,len(set(v))

output
Code:

$ ./python.py
Jane 2
Paul 2
Joe 5
Mary 4


joec@home 09-04-2010 09:53 PM

I have been playing with the concept of embedding LISP style commands in bash. LISP is an early Artificial Intelligent programming language. The concept is to write scripts that write their own scripts. So lets start by declaring the variables so they are defined as integers with the value of zero. Your list is in a file called "list"

Code:

cat list |awk '{print $1 $2 "='0'" }'
JoeWolfhound=0
JoeWolfhound=0
JoeBeagle=0
MaryPug=0
MaryDalmation=0
JoeChihuahua=0
MaryBoxer=0
JaneHusky=0
JaneHusky=0
JoeBulldog=0
JaneRidgeback=0
MaryMalamute=0
MaryBoxer=0
JoeChow=0
PaulDoberman=0
PaulDoberman=0
PaulBernese=0

Now for a problem this complex we can not simply pipe the commands to bash because the variables are local and as soon as each bash session ends the variables will be lost. So we need to build a global variable array number 1. The echo "" at the end is needed for one last carriage return.

Code:

for i in $(cat list) ; do COMMAND[1]=$( cat list |awk '{print $1 $2 "='0'" }' ; echo " " ) ; done
The cool thing is instead of writing all this code, we are generating the code by examining the data. As such we can preview the purposed script that is being developed. Now we go through and add the functions together.

Code:

cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }'
JoeWolfhound=$(($JoeWolfhound+1))
JoeWolfhound=$(($JoeWolfhound+1))
JoeBeagle=$(($JoeBeagle+1))
MaryPug=$(($MaryPug+1))
MaryDalmation=$(($MaryDalmation+1))
JoeChihuahua=$(($JoeChihuahua+1))
MaryBoxer=$(($MaryBoxer+1))
JaneHusky=$(($JaneHusky+1))
JaneHusky=$(($JaneHusky+1))
JoeBulldog=$(($JoeBulldog+1))
JaneRidgeback=$(($JaneRidgeback+1))
MaryMalamute=$(($MaryMalamute+1))
MaryBoxer=$(($MaryBoxer+1))
JoeChow=$(($JoeChow+1))
PaulDoberman=$(($PaulDoberman+1))
PaulDoberman=$(($PaulDoberman+1))
PaulBernese=$(($PaulBernese+1))

Now we will put this into an array number 2.

Code:

for i in $(cat list) ; do COMMAND[2]=$( cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }' ; echo " " ) ; done
Now we need a unique listing of the variables that have been built with the commands to display the results.

Code:

cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}'
echo JoeWolfhound $JoeWolfhound
echo JoeBeagle $JoeBeagle
echo MaryPug $MaryPug
echo MaryDalmation $MaryDalmation
echo JoeChihuahua $JoeChihuahua
echo MaryBoxer $MaryBoxer
echo JaneHusky $JaneHusky
echo JoeBulldog $JoeBulldog
echo JaneRidgeback $JaneRidgeback
echo MaryMalamute $MaryMalamute
echo MaryBoxer $MaryBoxer
echo JoeChow $JoeChow
echo PaulDoberman $PaulDoberman
echo PaulBernese $PaulBernese

Again we will put this into an array number 3.

Code:

for i in $(cat list) ; do COMMAND[3]=$( cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}' ; echo " " ) ; done
So now we just check the entire array we have a self built script.

Code:

echo "${COMMAND[@]}"
JoeWolfhound=0
JoeWolfhound=0
JoeBeagle=0
MaryPug=0
MaryDalmation=0
JoeChihuahua=0
MaryBoxer=0
JaneHusky=0
JaneHusky=0
JoeBulldog=0
JaneRidgeback=0
MaryMalamute=0
MaryBoxer=0
JoeChow=0
PaulDoberman=0
PaulDoberman=0
PaulBernese=0
  JoeWolfhound=$(($JoeWolfhound+1))
JoeWolfhound=$(($JoeWolfhound+1))
JoeBeagle=$(($JoeBeagle+1))
MaryPug=$(($MaryPug+1))
MaryDalmation=$(($MaryDalmation+1))
JoeChihuahua=$(($JoeChihuahua+1))
MaryBoxer=$(($MaryBoxer+1))
JaneHusky=$(($JaneHusky+1))
JaneHusky=$(($JaneHusky+1))
JoeBulldog=$(($JoeBulldog+1))
JaneRidgeback=$(($JaneRidgeback+1))
MaryMalamute=$(($MaryMalamute+1))
MaryBoxer=$(($MaryBoxer+1))
JoeChow=$(($JoeChow+1))
PaulDoberman=$(($PaulDoberman+1))
PaulDoberman=$(($PaulDoberman+1))
PaulBernese=$(($PaulBernese+1))
  echo JoeWolfhound $JoeWolfhound
echo JoeBeagle $JoeBeagle
echo MaryPug $MaryPug
echo MaryDalmation $MaryDalmation
echo JoeChihuahua $JoeChihuahua
echo MaryBoxer $MaryBoxer
echo JaneHusky $JaneHusky
echo JoeBulldog $JoeBulldog
echo JaneRidgeback $JaneRidgeback
echo MaryMalamute $MaryMalamute
echo MaryBoxer $MaryBoxer
echo JoeChow $JoeChow
echo PaulDoberman $PaulDoberman
echo PaulBernese $PaulBernese

Now we simply pipe the array into bash. We will add the unset command to flush out the variables and start over.

Code:

unset COMMAND
for i in $(cat list) ; do COMMAND[1]=$( cat list |awk '{print $1 $2 "='0'" }' ; echo " " ) ; done
for i in $(cat list) ; do COMMAND[2]=$( cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }' ; echo " " ) ; done
for i in $(cat list) ; do COMMAND[3]=$( cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}' ; echo " " ) ; done
echo "${COMMAND[@]}"| bash | sort
JaneHusky 2
JaneRidgeback 1
JoeBeagle 1
JoeBulldog 1
JoeChihuahua 1
JoeChow 1
JoeWolfhound 2
MaryBoxer 2
MaryBoxer 2
MaryDalmation 1
MaryMalamute 1
MaryPug 1
PaulBernese 1
PaulDoberman 2

Ok so it would take a bit more work to make "Joe Wolfhound" instead of "JoeWolfhound" but you get the idea. Why write code to do work, when you can write goes to figure out the work for you?

syg00 09-04-2010 09:58 PM

Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)

HuJo 09-04-2010 10:09 PM

Wow
 
You guys are fast... these are greatly helpful!
So I'm looking at the response provided by crts (you rock, BTW) to try and understand what is happening and I'm wrapped around the axle on this:
Code:


    if (owner[$1 $2] == 0) {

is this a multidimensional array? (with no comma?) I'm new to awk (which should be obvious) and don't quite understand how this works.

Thanks again!

crts 09-04-2010 10:27 PM

Quote:

Originally Posted by HuJo (Post 4088358)
You guys are fast... these are greatly helpful!
So I'm looking at the response provided by crts (you rock, BTW) to try and understand what is happening and I'm wrapped around the axle on this:
Code:


    if (owner[$1 $2] == 0) {

is this a multidimensional array? (with no comma?) I'm new to awk (which should be obvious) and don't quite understand how this works.

Thanks again!

Hi,

this will create an array field which can be accessed by the index of whatever is in $1 and $2. So when the first line
Joe Wolfhound

is processed then it will create
JoeWolfhound

as index. You can access the arrays value at this index with
owner[JoeWolfhound]

Now if this field does not exist it is implicitly initialized to '0'. In this case we increment the value and hence mark this combination as counted. The next time we see Joe and his wolfhound the comparison evaluates to false and it does not get counted. The counting itself happens in the array breed.

Hope this clears things up a bit.

joec@home 09-04-2010 10:30 PM

Quote:

Originally Posted by syg00 (Post 4088353)
Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)

Oh we can go on with all the different problems that can happen on that one. On some thing I actually had taken time to develop I had to write a subroutine that checked for CaSeSeNsItIvE so all the variables were not duplicated in different cases.

crts 09-04-2010 10:32 PM

Quote:

Originally Posted by syg00 (Post 4088353)
Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)

Ok,

this also copes with "German Shepherds"
Code:

awk '
{
    if (owner[$0] == 0) {
      owner[$0]++;
      breed[$1]+=1;
    }
}
END {
    for (i in breed) {
      if (i != "") {
          print breed[i],i;
      }
  }
}' infile

Tested with
Code:

Joe Wolfhound
Joe Wolfhound
Joe Beagle
Mary Pug
Mary Dalmation
Joe Chihuahua
Mary Boxer
Jane Husky
Jane Husky
Joe Bulldog
Jane Ridgeback
Mary Malamute
Mary Boxer
Joe Chow
Paul Doberman
Paul Doberman
Paul Bernese
Joe German Shepard
Joe German Doberman

I think I made the last race up. Just needed some sample data. Now Joe has 7 breeds.

HuJo 09-04-2010 10:54 PM

Awesome. That makes total sense to me now. I hadn't seen that trick in any of the on-line awk tutorials yet. Thanks again!

Just for the record (not that it makes a difference):
- I'm working mostly with csv files so the space isn't an issue (I just made up the example input file to make the question easier), but it's good to know anyway.
- Dobermans are actually a breed of German origin I believe... nice coincidence! :)

grail 09-04-2010 11:46 PM

Well here is one that should satisfy even syg00 ;)
Code:

awk '!a[$0]++{b[$1]++}END{for(x in b) print x,"has",b[x],"breeds"}' infile


All times are GMT -5. The time now is 02:55 PM.