LinuxQuestions.org - [SOLVED] awk distinct count

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - awk distinct count (https://www.linuxquestions.org/questions/linux-newbie-8/awk-distinct-count-830388/)

awk distinct count

So, I've learned a great deal from the responses to my previous posting, particularly about arrays in awk (mighty thanks to all who responded). However I'm now stuck on trying to develop what I think is solved with a nested loop, although I'm sure it's probably a much simpler solution.

Let's say I have a file like the following:

Joe Wolfhound
Joe Wolfhound
Joe Beagle
Mary Pug
Mary Dalmation
Joe Chihuahua
Mary Boxer
Jane Husky
Jane Husky
Joe Bulldog
Jane Ridgeback
Mary Malamute
Mary Boxer
Joe Chow
Paul Doberman
Paul Doberman
Paul Bernese

How do I find the number of breeds each person owns? I'm able to determine the number of unique breeds:
{ breed[$2] += 1 } END { for (i in breed) print i, breed[i] }

as well as the number of dogs per owner, but determining the number of distinct breeds per owner is giving me problems

Thanks all in advance!

Hi,

this worked for me

Code:

awk '

BEGIN {

    breed[""]=0; owner[""]=0;

}

{

    if (owner[$1 $2] == 0) {

      owner[$1 $2]++;

      breed[$1]+=1;

    }

}

END {

    for (i in breed) {

      if (i != "") {

          print breed[i],i;

      }

  }

}' infile

Code:

awk '{a=$1;$1=""}(!(breeds[a]~$0 )){

 breeds[a]=breeds[a]"|"$0}

END{for(i in breeds){m=split(breeds[i],b,"|"); print i,breeds[i],m-1 }}' file

And if you have Python,

Code:

from collections import defaultdict

breeds=defaultdict(list)

with open("file") as f:

    for line in f:

        name,breed=line.rstrip().split(" ",1)

        breeds[name].append(breed)

for k,v in breeds.iteritems():

    print k,len(set(v))

output

Code:

$ ./python.py

Jane 2

Paul 2

Joe 5

Mary 4

I have been playing with the concept of embedding LISP style commands in bash. LISP is an early Artificial Intelligent programming language. The concept is to write scripts that write their own scripts. So lets start by declaring the variables so they are defined as integers with the value of zero. Your list is in a file called "list"

Code:

cat list |awk '{print $1 $2 "='0'" }' 

JoeWolfhound=0

JoeWolfhound=0

JoeBeagle=0

MaryPug=0

MaryDalmation=0

JoeChihuahua=0

MaryBoxer=0

JaneHusky=0

JaneHusky=0

JoeBulldog=0

JaneRidgeback=0

MaryMalamute=0

MaryBoxer=0

JoeChow=0

PaulDoberman=0

PaulDoberman=0

PaulBernese=0

Now for a problem this complex we can not simply pipe the commands to bash because the variables are local and as soon as each bash session ends the variables will be lost. So we need to build a global variable array number 1. The echo "" at the end is needed for one last carriage return.

Code:

for i in $(cat list) ; do COMMAND[1]=$( cat list |awk '{print $1 $2 "='0'" }' ; echo " " ) ; done

The cool thing is instead of writing all this code, we are generating the code by examining the data. As such we can preview the purposed script that is being developed. Now we go through and add the functions together.

Code:

cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }' 

JoeWolfhound=$(($JoeWolfhound+1))

JoeWolfhound=$(($JoeWolfhound+1))

JoeBeagle=$(($JoeBeagle+1))

MaryPug=$(($MaryPug+1))

MaryDalmation=$(($MaryDalmation+1))

JoeChihuahua=$(($JoeChihuahua+1))

MaryBoxer=$(($MaryBoxer+1))

JaneHusky=$(($JaneHusky+1))

JaneHusky=$(($JaneHusky+1))

JoeBulldog=$(($JoeBulldog+1))

JaneRidgeback=$(($JaneRidgeback+1))

MaryMalamute=$(($MaryMalamute+1))

MaryBoxer=$(($MaryBoxer+1))

JoeChow=$(($JoeChow+1))

PaulDoberman=$(($PaulDoberman+1))

PaulDoberman=$(($PaulDoberman+1))

PaulBernese=$(($PaulBernese+1))

Now we will put this into an array number 2.

Code:

for i in $(cat list) ; do COMMAND[2]=$( cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }' ; echo " " ) ; done

Now we need a unique listing of the variables that have been built with the commands to display the results.

Code:

cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}'

echo JoeWolfhound $JoeWolfhound

echo JoeBeagle $JoeBeagle

echo MaryPug $MaryPug

echo MaryDalmation $MaryDalmation

echo JoeChihuahua $JoeChihuahua

echo MaryBoxer $MaryBoxer

echo JaneHusky $JaneHusky

echo JoeBulldog $JoeBulldog

echo JaneRidgeback $JaneRidgeback

echo MaryMalamute $MaryMalamute

echo MaryBoxer $MaryBoxer

echo JoeChow $JoeChow

echo PaulDoberman $PaulDoberman

echo PaulBernese $PaulBernese

Again we will put this into an array number 3.

Code:

for i in $(cat list) ; do COMMAND[3]=$( cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}' ; echo " " ) ; done

So now we just check the entire array we have a self built script.

Code:

echo "${COMMAND[@]}"

JoeWolfhound=0

JoeWolfhound=0

JoeBeagle=0

MaryPug=0

MaryDalmation=0

JoeChihuahua=0

MaryBoxer=0

JaneHusky=0

JaneHusky=0

JoeBulldog=0

JaneRidgeback=0

MaryMalamute=0

MaryBoxer=0

JoeChow=0

PaulDoberman=0

PaulDoberman=0

PaulBernese=0

  JoeWolfhound=$(($JoeWolfhound+1))

JoeWolfhound=$(($JoeWolfhound+1))

JoeBeagle=$(($JoeBeagle+1))

MaryPug=$(($MaryPug+1))

MaryDalmation=$(($MaryDalmation+1))

JoeChihuahua=$(($JoeChihuahua+1))

MaryBoxer=$(($MaryBoxer+1))

JaneHusky=$(($JaneHusky+1))

JaneHusky=$(($JaneHusky+1))

JoeBulldog=$(($JoeBulldog+1))

JaneRidgeback=$(($JaneRidgeback+1))

MaryMalamute=$(($MaryMalamute+1))

MaryBoxer=$(($MaryBoxer+1))

JoeChow=$(($JoeChow+1))

PaulDoberman=$(($PaulDoberman+1))

PaulDoberman=$(($PaulDoberman+1))

PaulBernese=$(($PaulBernese+1))

  echo JoeWolfhound $JoeWolfhound

echo JoeBeagle $JoeBeagle

echo MaryPug $MaryPug

echo MaryDalmation $MaryDalmation

echo JoeChihuahua $JoeChihuahua

echo MaryBoxer $MaryBoxer

echo JaneHusky $JaneHusky

echo JoeBulldog $JoeBulldog

echo JaneRidgeback $JaneRidgeback

echo MaryMalamute $MaryMalamute

echo MaryBoxer $MaryBoxer

echo JoeChow $JoeChow

echo PaulDoberman $PaulDoberman

echo PaulBernese $PaulBernese

Now we simply pipe the array into bash. We will add the unset command to flush out the variables and start over.

Code:

unset COMMAND

for i in $(cat list) ; do COMMAND[1]=$( cat list |awk '{print $1 $2 "='0'" }' ; echo " " ) ; done

for i in $(cat list) ; do COMMAND[2]=$( cat list |awk '{print $1 $2 "=$(($"$1 $2 "+1))" }' ; echo " " ) ; done

for i in $(cat list) ; do COMMAND[3]=$( cat list |awk '{print $1 $2}' | uniq | awk '{ print "echo "$0 " $"$0}' ; echo " " ) ; done

echo "${COMMAND[@]}"| bash | sort

JaneHusky 2

JaneRidgeback 1

JoeBeagle 1

JoeBulldog 1

JoeChihuahua 1

JoeChow 1

JoeWolfhound 2

MaryBoxer 2

MaryBoxer 2

MaryDalmation 1

MaryMalamute 1

MaryPug 1

PaulBernese 1

PaulDoberman 2

Ok so it would take a bit more work to make "Joe Wolfhound" instead of "JoeWolfhound" but you get the idea. Why write code to do work, when you can write goes to figure out the work for you?

Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)

You guys are fast... these are greatly helpful!
So I'm looking at the response provided by crts (you rock, BTW) to try and understand what is happening and I'm wrapped around the axle on this:

Code:



    if (owner[$1 $2] == 0) {

is this a multidimensional array? (with no comma?) I'm new to awk (which should be obvious) and don't quite understand how this works.

Thanks again!

Quote:

Originally Posted by HuJo (Post 4088358)

You guys are fast... these are greatly helpful!
So I'm looking at the response provided by crts (you rock, BTW) to try and understand what is happening and I'm wrapped around the axle on this:

Code:



    if (owner[$1 $2] == 0) {

is this a multidimensional array? (with no comma?) I'm new to awk (which should be obvious) and don't quite understand how this works.

Thanks again!

Hi,

this will create an array field which can be accessed by the index of whatever is in $1 and $2. So when the first line
Joe Wolfhound

is processed then it will create
JoeWolfhound

as index. You can access the arrays value at this index with
owner[JoeWolfhound]

Now if this field does not exist it is implicitly initialized to '0'. In this case we increment the value and hence mark this combination as counted. The next time we see Joe and his wolfhound the comparison evaluates to false and it does not get counted. The counting itself happens in the array breed.

Hope this clears things up a bit.

Quote:

Originally Posted by syg00 (Post 4088353)

Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)

Oh we can go on with all the different problems that can happen on that one. On some thing I actually had taken time to develop I had to write a subroutine that checked for CaSeSeNsItIvE so all the variables were not duplicated in different cases.

Quote:

Originally Posted by syg00 (Post 4088353)

Better hope there aren't any (blank separated) names - say "German Shepherd"
(referencing the awk offerings)

Ok,

this also copes with "German Shepherds"

Code:

awk '

{

    if (owner[$0] == 0) {

      owner[$0]++;

      breed[$1]+=1;

    }

}

END {

    for (i in breed) {

      if (i != "") {

          print breed[i],i;

      }

  }

}' infile

Tested with

Code:

Joe Wolfhound

Joe Wolfhound

Joe Beagle

Mary Pug

Mary Dalmation

Joe Chihuahua

Mary Boxer

Jane Husky

Jane Husky

Joe Bulldog

Jane Ridgeback

Mary Malamute

Mary Boxer

Joe Chow

Paul Doberman

Paul Doberman

Paul Bernese

Joe German Shepard

Joe German Doberman

I think I made the last race up. Just needed some sample data. Now Joe has 7 breeds.

Awesome. That makes total sense to me now. I hadn't seen that trick in any of the on-line awk tutorials yet. Thanks again!

Just for the record (not that it makes a difference):
- I'm working mostly with csv files so the space isn't an issue (I just made up the example input file to make the question easier), but it's good to know anyway.
- Dobermans are actually a breed of German origin I believe... nice coincidence! :)

Well here is one that should satisfy even syg00 ;)

Code:

awk '!a[$0]++{b[$1]++}END{for(x in b) print x,"has",b[x],"breeds"}' infile