LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   AWK associative array problem (https://www.linuxquestions.org/questions/linux-general-1/awk-associative-array-problem-4175494446/)

rng 02-10-2014 11:34 AM

AWK associative array problem
 
Following code runs well on a file where I know there are 3 fields:
Code:

gawk '
{
        arr1[$1]++
        arr2[$2]++
        arr3[$3]++
}
END {
                for(i in arr1) { print(i,":" ,arr1[i])}
                print("===============================")
                for(i in arr2) { print(i,":" ,arr2[i]) }
                print("===============================")
                for(i in arr3) { print(i,":" ,arr3[i]) }
                print("===============================")
               
}' <infile

But is it possible to run this for all fields if the number of fields is not known beforehand? Thanks in advance.

danielbmartin 02-10-2014 03:14 PM

Help us to help you. Provide a sample input file (10-15 lines will do). Construct a sample output file which corresponds to your sample input and post both samples here. With "InFile" and "OutFile" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

colucix 02-10-2014 04:46 PM

Indeed - as Daniel pointed out - a sample input/output would be useful to fully understand your requirement. Anyway, here is an example using multi-dimensional arrays:
Code:

$ cat file
A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4

$ awk '
{
  for ( i = 1; i <= NF; i++)
    arr[i,$i]++
}

END {
  for ( combined in arr ) {
    split(combined,separated,SUBSEP)
    c[separated[1]]++
    f[separated[1]] ? f[separated[1]] = (f[separated[1]] " " separated[2]) : f[separated[1]] = separated[2]
  }

  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ )
      print _[j],":",arr[i,_[j]]
    print "==============================="
  }
}' file
A1 : 2
B1 : 1
C1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
===============================
A4 : 2
B4 : 1
===============================
B5 : 1
===============================


rng 02-10-2014 06:42 PM

I should have provided an input file and output but I thought the problem was very clear. Thanks colucix, for the input file and the answer. This is exactly what I was looking for. I am now trying to understand how your elegant solution works.

The output data is sorted but it is so because of the way data is entered in input file. If the input file is:
Code:

B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

The output becomes:
Code:

Z1 : 1
A1 : 2
B1 : 1
C1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
Z1 : 1
A3 : 1
B3 : 1
C3 : 2
===============================
Z1 : 1
A4 : 2
B4 : 1
===============================
B5 : 1
===============================

Can I get sorted output data?

danielbmartin 02-10-2014 09:28 PM

Quote:

Originally Posted by rng (Post 5115253)
Can I get sorted output data?

Sorted by what? Name or count?

Daniel B. Martin

danielbmartin 02-10-2014 09:36 PM

Two-dimensional arrays are a powerful language feature but that kind of code can be difficult to write, difficult to read. At the expense of execution time you may read the input multiple times but enjoy the simplicity of one-dimensional arrays.

With this InFile ...
Code:

B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

... this code ...
Code:

nc=$(awk '{if (NF>maxNF) maxNF=NF}; END{print maxNF}' $InFile)
rm $OutFile
for (( c=1; c<=$nc; c++ ))
  do
    awk  -v "c=$c" '{if (c<=NF) {name[$c]=$c; count[$c]++}};
      END{for (j in name) print name[j],":",count[j]}' $InFile >>$OutFile
    echo "=======================" >>$OutFile
done

... produced this OutFile ...
Code:

Z1 : 1
A1 : 2
B1 : 1
C1 : 1
=======================
C2 : 1
Z1 : 1
A2 : 1
B2 : 2
=======================
B3 : 1
C3 : 2
Z1 : 1
A3 : 1
=======================
A4 : 2
B4 : 1
Z1 : 1
=======================
B5 : 1
=======================

Daniel B. Martin

colucix 02-11-2014 02:37 AM

To sort arrays in awk you may use the asort function. Following my previous example you can change the last part into:
Code:

  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
  }

instead of printing directly the content of the multi-dimensional array, we assign the whole string to the array a, then we sort its content and finally print it out.

rng 02-11-2014 06:30 AM

Thanks for your reply. The solution looks elegant but it is producing some errors if there are blank fields. The programs that I tested is:
Code:

#! /bin/bash
awk '
{
  for ( i = 1; i <= NF; i++)
    arr[i,$i]++
}

END {
  for ( combined in arr ) {
    split(combined,separated,SUBSEP)
    c[separated[1]]++
    f[separated[1]] ? f[separated[1]] = (f[separated[1]] " " separated[2]) : f[separated[1]] = separated[2]
  }

 for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
  }
}' <datafile.txt

The data files and outputs are as follows (the error lines are marked by me):
Code:


B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

output:

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z1 : 1
===============================
A4 : 2
B4 : 1
Z1 : 1
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
===============================
B4 : 1    <<<<<<<<<<<<< error: this output should not be there
B5 : 1
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
===============================







B1 B2 B3 B4
C1 C2 C3 C4
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

output: (all correct):

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z1 : 1
===============================
A4 : 2
B4 : 1
C4 : 1
Z1 : 1
===============================





B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z2 Z3 Z4
A1 A2 A3 A4
A1 B2 C3 A4

OUTPUT:

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z3 : 1
===============================
A4 : 2
B4 : 1
Z3 : 1    <<<<<<<< error: this output should not be there
Z4 : 1
===============================
B4 : 1    <<<<<<<< error: this output should not be there
B5 : 1
Z3 : 1    <<<<<<<< error: this output should not be there
Z4 : 1    <<<<<<<< error: this output should not be there
===============================






B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
Z1 Z2 Z3 Z4 Z5
A1 A2 A3 A4 A5
A1 B2 C3 A4 Z5

OUTPUT: ALL CORRECT:

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z3 : 1
===============================
A4 : 2
B4 : 1
C4 : 1
Z4 : 1
===============================
A5 : 1
B5 : 1
C5 : 1
Z5 : 2
===============================


colucix 02-11-2014 07:18 AM

Ops, you right! Mine was a greenhorn's mistake: the array a retains the values of the last fields of the (previous) longer lines. Just add a delete statement at the end of the for loop:
Code:

  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
    delete a
  }


rng 02-11-2014 10:35 AM

Yes, it works perfectly now. Thanks again.

grail 02-12-2014 02:36 AM

I thought I would just throw 2 cents in ... colucix's solution is still using a 1 dimensional associative array. The SUBSEP delimited array could just as easily use any other character
not already in the values either side of the delimiter. However, as of v4, awk does indeed offer true multi-dimensional arrays.
(See here for more details)

Using a snippet from the script:
Code:

# current associative array
for ( i = 1; i <= NF; i++)
    arr[i,$i]++

# 2 dimensional array
for ( i = 1; i <= NF; i++)
    arr[i][$i]++

You no longer require the split command and can simply use your for loops to process the array.
The gotcha here would be making sure you sort the second dimension and not the first ;)

rng 02-12-2014 08:00 PM

for following test file:
Code:

A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4

I tried following code:
Code:

#!/bin/bash
gawk '{
        for ( i = 1; i <= NF; i++)               
                    counter[i][$i]++
}
END {
        for (j=1; j<=length(counter); j++){
                for (i in counter[j]) {
                        print counter[j][i], ":",$counter[j][i];
                        }
                print "========================"
        }
}' <$1

But it does not work properly. The counts are correct but not the labels. The output was:
Code:

2 : B2
1 : A1
1 : A1
========================
1 : A1
1 : A1
2 : B2
========================
1 : A1
2 : B2
1 : A1
========================
2 : B2
1 : A1
========================
1 : A1
========================


danielbmartin 02-12-2014 08:28 PM

This solution uses one-dimensional arrays.

With this InFile ...
Code:

A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4

... this awk ...
Code:

awk '{line[NR]=$0" "; if (nc<NF) nc=NF};
  END{for (col=1;col<=nc;col++) {delete count;
      for (row=1;row<=NR;row++)
        {fb=index(line[row]," ");
          if (fb>0) {name=substr(line[row],1,fb-1); 
                    line[row]=substr(line[row],fb+1);
                    count[name]++}}
      n=asorti(count,b); for (j=1;j<=n;j++)
      print b[j],":",count[b[j]]
      print "==============="}}' <$InFile >$OutFile

... produced this OutFile ...
Code:

A1 : 2
B1 : 1
C1 : 1
===============
A2 : 1
B2 : 2
C2 : 1
===============
A3 : 1
B3 : 1
C3 : 2
===============
A4 : 2
B4 : 1
===============
B5 : 1
===============

Daniel B. Martin

rng 02-12-2014 09:37 PM

Thanks for your solution.
How can we use 2-dimensional array? I suspect it will be easier to understand.

danielbmartin 02-12-2014 09:58 PM

Quote:

Originally Posted by rng (Post 5116598)
Thanks for your solution.
How can we use 2-dimensional array? I suspect it will be easier to understand.

Other LQ members already posted solutions using 2-dimensional arrays. I did my one-dimensional solution to show a different approach. My computer runs a back-level awk so I cannot use the true two-dimensional language feature shown by others.

Daniel B. Martin

.


All times are GMT -5. The time now is 01:08 PM.