LinuxQuestions.org - [SOLVED] AWK associative array problem

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - AWK associative array problem (https://www.linuxquestions.org/questions/linux-general-1/awk-associative-array-problem-4175494446/)

AWK associative array problem

Following code runs well on a file where I know there are 3 fields:

Code:

gawk '

{

        arr1[$1]++

        arr2[$2]++

        arr3[$3]++

}

END {

                for(i in arr1) { print(i,":" ,arr1[i])}

                print("===============================")

                for(i in arr2) { print(i,":" ,arr2[i]) }

                print("===============================")

                for(i in arr3) { print(i,":" ,arr3[i]) }

                print("===============================")

                

}' <infile

But is it possible to run this for all fields if the number of fields is not known beforehand? Thanks in advance.

Help us to help you. Provide a sample input file (10-15 lines will do). Construct a sample output file which corresponds to your sample input and post both samples here. With "InFile" and "OutFile" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

Indeed - as Daniel pointed out - a sample input/output would be useful to fully understand your requirement. Anyway, here is an example using multi-dimensional arrays:

Code:

$ cat file

A1 A2 A3 A4

B1 B2 B3 B4 B5

C1 C2 C3

A1 B2 C3 A4

$ awk '

{

  for ( i = 1; i <= NF; i++)

    arr[i,$i]++

}



END {

  for ( combined in arr ) {

    split(combined,separated,SUBSEP)

    c[separated[1]]++

    f[separated[1]] ? f[separated[1]] = (f[separated[1]] " " separated[2]) : f[separated[1]] = separated[2]

  }



  for ( i = 1; i <= length(c); i++ ) {

    split(f[i],_)

    for ( j = 1; j <= length(_); j++ )

      print _[j],":",arr[i,_[j]]

    print "==============================="

  }

}' file

A1 : 2

B1 : 1

C1 : 1

===============================

A2 : 1

B2 : 2

C2 : 1

===============================

A3 : 1

B3 : 1

C3 : 2

===============================

A4 : 2

B4 : 1

===============================

B5 : 1

===============================

I should have provided an input file and output but I thought the problem was very clear. Thanks colucix, for the input file and the answer. This is exactly what I was looking for. I am now trying to understand how your elegant solution works.

The output data is sorted but it is so because of the way data is entered in input file. If the input file is:

Code:

B1 B2 B3 B4 B5

C1 C2 C3

Z1 Z1 Z1 Z1

A1 A2 A3 A4

A1 B2 C3 A4

The output becomes:

Code:

Z1 : 1

A1 : 2

B1 : 1

C1 : 1

===============================

A2 : 1

B2 : 2

C2 : 1

Z1 : 1

===============================

Z1 : 1

A3 : 1

B3 : 1

C3 : 2

===============================

Z1 : 1

A4 : 2

B4 : 1

===============================

B5 : 1

===============================

Can I get sorted output data?

Quote:

Originally Posted by rng (Post 5115253)

Can I get sorted output data?

Sorted by what? Name or count?

Daniel B. Martin

Two-dimensional arrays are a powerful language feature but that kind of code can be difficult to write, difficult to read. At the expense of execution time you may read the input multiple times but enjoy the simplicity of one-dimensional arrays.

With this InFile ...

Code:

B1 B2 B3 B4 B5

C1 C2 C3

Z1 Z1 Z1 Z1

A1 A2 A3 A4

A1 B2 C3 A4

... this code ...

Code:

nc=$(awk '{if (NF>maxNF) maxNF=NF}; END{print maxNF}' $InFile)

rm $OutFile

for (( c=1; c<=$nc; c++ ))

  do

    awk  -v "c=$c" '{if (c<=NF) {name[$c]=$c; count[$c]++}}; 

      END{for (j in name) print name[j],":",count[j]}' $InFile >>$OutFile

    echo "=======================" >>$OutFile

done

... produced this OutFile ...

Code:

Z1 : 1

A1 : 2

B1 : 1

C1 : 1

=======================

C2 : 1

Z1 : 1

A2 : 1

B2 : 2

=======================

B3 : 1

C3 : 2

Z1 : 1

A3 : 1

=======================

A4 : 2

B4 : 1

Z1 : 1

=======================

B5 : 1

=======================

Daniel B. Martin

To sort arrays in awk you may use the asort function. Following my previous example you can change the last part into:

Code:

  for ( i = 1; i <= length(c); i++ ) {

    split(f[i],_)

    for ( j = 1; j <= length(_); j++ ) {

      a[j] = ( _[j] " : " arr[i,_[j]] )

    }

    asort(a)

    for ( j = 1; j <= length(a); j++ ) {

      print a[j]

    }

    print "==============================="

  }

instead of printing directly the content of the multi-dimensional array, we assign the whole string to the array a, then we sort its content and finally print it out.

Thanks for your reply. The solution looks elegant but it is producing some errors if there are blank fields. The programs that I tested is:

Code:

#! /bin/bash

awk '

{

  for ( i = 1; i <= NF; i++)

    arr[i,$i]++

}



END {

  for ( combined in arr ) {

    split(combined,separated,SUBSEP)

    c[separated[1]]++

    f[separated[1]] ? f[separated[1]] = (f[separated[1]] " " separated[2]) : f[separated[1]] = separated[2]

  }



 for ( i = 1; i <= length(c); i++ ) {

    split(f[i],_)

    for ( j = 1; j <= length(_); j++ ) {

      a[j] = ( _[j] " : " arr[i,_[j]] )

    }

    asort(a)

    for ( j = 1; j <= length(a); j++ ) {

      print a[j]

    }

    print "==============================="

  }

}' <datafile.txt

The data files and outputs are as follows (the error lines are marked by me):

Code:



B1 B2 B3 B4 B5

C1 C2 C3

Z1 Z1 Z1 Z1

A1 A2 A3 A4

A1 B2 C3 A4



output: 



A1 : 2

B1 : 1

C1 : 1

Z1 : 1

===============================

A2 : 1

B2 : 2

C2 : 1

Z1 : 1

===============================

A3 : 1

B3 : 1

C3 : 2

Z1 : 1

===============================

A4 : 2

B4 : 1

Z1 : 1

Z1 : 1    <<<<<<<<<<<<< error: this output should not be there

===============================

B4 : 1    <<<<<<<<<<<<< error: this output should not be there

B5 : 1

Z1 : 1    <<<<<<<<<<<<< error: this output should not be there

Z1 : 1    <<<<<<<<<<<<< error: this output should not be there

===============================















B1 B2 B3 B4

C1 C2 C3 C4

Z1 Z1 Z1 Z1

A1 A2 A3 A4

A1 B2 C3 A4



output: (all correct): 



A1 : 2

B1 : 1

C1 : 1

Z1 : 1

===============================

A2 : 1

B2 : 2

C2 : 1

Z1 : 1

===============================

A3 : 1

B3 : 1

C3 : 2

Z1 : 1

===============================

A4 : 2

B4 : 1

C4 : 1

Z1 : 1

===============================











B1 B2 B3 B4 B5

C1 C2 C3

Z1 Z2 Z3 Z4

A1 A2 A3 A4

A1 B2 C3 A4



OUTPUT:



A1 : 2

B1 : 1

C1 : 1

Z1 : 1

===============================

A2 : 1

B2 : 2

C2 : 1

Z2 : 1

===============================

A3 : 1

B3 : 1

C3 : 2

Z3 : 1

===============================

A4 : 2

B4 : 1

Z3 : 1    <<<<<<<< error: this output should not be there

Z4 : 1

===============================

B4 : 1    <<<<<<<< error: this output should not be there

B5 : 1

Z3 : 1    <<<<<<<< error: this output should not be there

Z4 : 1    <<<<<<<< error: this output should not be there

===============================













B1 B2 B3 B4 B5

C1 C2 C3 C4 C5

Z1 Z2 Z3 Z4 Z5

A1 A2 A3 A4 A5

A1 B2 C3 A4 Z5



OUTPUT: ALL CORRECT: 



A1 : 2

B1 : 1

C1 : 1

Z1 : 1

===============================

A2 : 1

B2 : 2

C2 : 1

Z2 : 1

===============================

A3 : 1

B3 : 1

C3 : 2

Z3 : 1

===============================

A4 : 2

B4 : 1

C4 : 1

Z4 : 1

===============================

A5 : 1

B5 : 1

C5 : 1

Z5 : 2

===============================

Ops, you right! Mine was a greenhorn's mistake: the array a retains the values of the last fields of the (previous) longer lines. Just add a delete statement at the end of the for loop:

Code:

  for ( i = 1; i <= length(c); i++ ) {

    split(f[i],_)

    for ( j = 1; j <= length(_); j++ ) {

      a[j] = ( _[j] " : " arr[i,_[j]] )

    }

    asort(a)

    for ( j = 1; j <= length(a); j++ ) {

      print a[j]

    }

    print "==============================="

    delete a

  }

Yes, it works perfectly now. Thanks again.

I thought I would just throw 2 cents in ... colucix's solution is still using a 1 dimensional associative array. The SUBSEP delimited array could just as easily use any other character
not already in the values either side of the delimiter. However, as of v4, awk does indeed offer true multi-dimensional arrays.
(See here for more details)

Using a snippet from the script:

Code:

# current associative array

for ( i = 1; i <= NF; i++)

    arr[i,$i]++



# 2 dimensional array

for ( i = 1; i <= NF; i++)

    arr[i][$i]++

You no longer require the split command and can simply use your for loops to process the array.
The gotcha here would be making sure you sort the second dimension and not the first ;)

for following test file:

Code:

A1 A2 A3 A4

B1 B2 B3 B4 B5

C1 C2 C3

A1 B2 C3 A4

I tried following code:

Code:

#!/bin/bash

gawk '{

        for ( i = 1; i <= NF; i++)                

                    counter[i][$i]++

}

END {

        for (j=1; j<=length(counter); j++){

                for (i in counter[j]) {

                        print counter[j][i], ":",$counter[j][i]; 

                        }

                print "========================"

        }

}' <$1

But it does not work properly. The counts are correct but not the labels. The output was:

Code:

2 : B2

1 : A1

1 : A1

========================

1 : A1

1 : A1

2 : B2

========================

1 : A1

2 : B2

1 : A1

========================

2 : B2

1 : A1

========================

1 : A1

========================

This solution uses one-dimensional arrays.

With this InFile ...

Code:

A1 A2 A3 A4

B1 B2 B3 B4 B5

C1 C2 C3

A1 B2 C3 A4

... this awk ...

Code:

awk '{line[NR]=$0" "; if (nc<NF) nc=NF};

  END{for (col=1;col<=nc;col++) {delete count;

      for (row=1;row<=NR;row++)

        {fb=index(line[row]," ");

          if (fb>0) {name=substr(line[row],1,fb-1);  

                    line[row]=substr(line[row],fb+1);

                    count[name]++}} 

      n=asorti(count,b); for (j=1;j<=n;j++)

      print b[j],":",count[b[j]]

      print "==============="}}' <$InFile >$OutFile

... produced this OutFile ...

Code:

A1 : 2

B1 : 1

C1 : 1

===============

A2 : 1

B2 : 2

C2 : 1

===============

A3 : 1

B3 : 1

C3 : 2

===============

A4 : 2

B4 : 1

===============

B5 : 1

===============

Daniel B. Martin

Thanks for your solution.
How can we use 2-dimensional array? I suspect it will be easier to understand.

Quote:

Originally Posted by rng (Post 5116598)

Thanks for your solution.
How can we use 2-dimensional array? I suspect it will be easier to understand.

Other LQ members already posted solutions using 2-dimensional arrays. I did my one-dimensional solution to show a different approach. My computer runs a back-level awk so I cannot use the true two-dimensional language feature shown by others.

Daniel B. Martin

.