[SOLVED] AWK associative array problem

rng · 02-10-2014, 11:34 AM

Following code runs well on a file where I know there are 3 fields:

Code:

gawk '
{
	arr1[$1]++
	arr2[$2]++
	arr3[$3]++
}
END {
		for(i in arr1) { print(i,":" ,arr1[i])}
		print("===============================")
		for(i in arr2) { print(i,":" ,arr2[i]) }
		print("===============================")
		for(i in arr3) { print(i,":" ,arr3[i]) }
		print("===============================")
		
}' <infile

But is it possible to run this for all fields if the number of fields is not known beforehand? Thanks in advance.

danielbmartin · 02-10-2014, 03:14 PM

Help us to help you. Provide a sample input file (10-15 lines will do). Construct a sample output file which corresponds to your sample input and post both samples here. With "InFile" and "OutFile" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin

colucix · 02-10-2014, 04:46 PM

Indeed - as Daniel pointed out - a sample input/output would be useful to fully understand your requirement. Anyway, here is an example using multi-dimensional arrays:

Code:

$ cat file
A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4
$ awk '
{
  for ( i = 1; i <= NF; i++)
    arr[i,$i]++
}

END {
  for ( combined in arr ) {
    split(combined,separated,SUBSEP)
    c[separated[1]]++
    f[separated[1]] ? f[separated[1]] = (f[separated[1]] " " separated[2]) : f[separated[1]] = separated[2]
  }

  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ )
      print _[j],":",arr[i,_[j]]
    print "==============================="
  }
}' file
A1 : 2
B1 : 1
C1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
===============================
A4 : 2
B4 : 1
===============================
B5 : 1
===============================

rng · 02-10-2014, 06:42 PM

I should have provided an input file and output but I thought the problem was very clear. Thanks colucix, for the input file and the answer. This is exactly what I was looking for. I am now trying to understand how your elegant solution works.

The output data is sorted but it is so because of the way data is entered in input file. If the input file is:

Code:

B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

The output becomes:

Code:

Z1 : 1
A1 : 2
B1 : 1
C1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
Z1 : 1
A3 : 1
B3 : 1
C3 : 2
===============================
Z1 : 1
A4 : 2
B4 : 1
===============================
B5 : 1
===============================

Can I get sorted output data?

danielbmartin · 02-10-2014, 09:28 PM

Quote:

Originally Posted by rng

Can I get sorted output data?

Sorted by what? Name or count?

Daniel B. Martin

danielbmartin · 02-10-2014, 09:36 PM

Two-dimensional arrays are a powerful language feature but that kind of code can be difficult to write, difficult to read. At the expense of execution time you may read the input multiple times but enjoy the simplicity of one-dimensional arrays.

With this InFile ...

Code:

B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

... this code ...

Code:

nc=$(awk '{if (NF>maxNF) maxNF=NF}; END{print maxNF}' $InFile)
rm $OutFile
for (( c=1; c<=$nc; c++ ))
  do
    awk  -v "c=$c" '{if (c<=NF) {name[$c]=$c; count[$c]++}}; 
      END{for (j in name) print name[j],":",count[j]}' $InFile >>$OutFile
    echo "=======================" >>$OutFile
done

... produced this OutFile ...

Code:

Z1 : 1
A1 : 2
B1 : 1
C1 : 1
=======================
C2 : 1
Z1 : 1
A2 : 1
B2 : 2
=======================
B3 : 1
C3 : 2
Z1 : 1
A3 : 1
=======================
A4 : 2
B4 : 1
Z1 : 1
=======================
B5 : 1
=======================

Daniel B. Martin

colucix · 02-11-2014, 02:37 AM

To sort arrays in awk you may use the asort function. Following my previous example you can change the last part into:

Code:

  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
  }

instead of printing directly the content of the multi-dimensional array, we assign the whole string to the array a, then we sort its content and finally print it out.

rng · 02-11-2014, 06:30 AM

Thanks for your reply. The solution looks elegant but it is producing some errors if there are blank fields. The programs that I tested is:

Code:

#! /bin/bash
awk '
{
  for ( i = 1; i <= NF; i++)
    arr[i,$i]++
}

END {
  for ( combined in arr ) {
    split(combined,separated,SUBSEP)
    c[separated[1]]++
    f[separated[1]] ? f[separated[1]] = (f[separated[1]] " " separated[2]) : f[separated[1]] = separated[2]
  }

 for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
  }
}' <datafile.txt

The data files and outputs are as follows (the error lines are marked by me):

Code:

B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

output: 

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z1 : 1
===============================
A4 : 2
B4 : 1
Z1 : 1
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
===============================
B4 : 1    <<<<<<<<<<<<< error: this output should not be there
B5 : 1
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
===============================







B1 B2 B3 B4
C1 C2 C3 C4
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

output: (all correct): 

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z1 : 1
===============================
A4 : 2
B4 : 1
C4 : 1
Z1 : 1
===============================





B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z2 Z3 Z4
A1 A2 A3 A4
A1 B2 C3 A4

OUTPUT:

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z3 : 1
===============================
A4 : 2
B4 : 1
Z3 : 1     <<<<<<<< error: this output should not be there
Z4 : 1
===============================
B4 : 1     <<<<<<<< error: this output should not be there
B5 : 1
Z3 : 1     <<<<<<<< error: this output should not be there
Z4 : 1     <<<<<<<< error: this output should not be there
===============================






B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
Z1 Z2 Z3 Z4 Z5
A1 A2 A3 A4 A5
A1 B2 C3 A4 Z5

OUTPUT: ALL CORRECT: 

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z3 : 1
===============================
A4 : 2
B4 : 1
C4 : 1
Z4 : 1
===============================
A5 : 1
B5 : 1
C5 : 1
Z5 : 2
===============================

colucix · 02-11-2014, 07:18 AM

Ops, you right! Mine was a greenhorn's mistake: the array a retains the values of the last fields of the (previous) longer lines. Just add a delete statement at the end of the for loop:

Code:

  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
    delete a
  }

rng · 02-11-2014, 10:35 AM

Yes, it works perfectly now. Thanks again.

grail · 02-12-2014, 02:36 AM

I thought I would just throw 2 cents in ... colucix's solution is still using a 1 dimensional associative array. The SUBSEP delimited array could just as easily use any other character
not already in the values either side of the delimiter. However, as of v4, awk does indeed offer true multi-dimensional arrays.
(See here for more details)

Using a snippet from the script:

Code:

# current associative array
for ( i = 1; i <= NF; i++)
    arr[i,$i]++

# 2 dimensional array
for ( i = 1; i <= NF; i++)
    arr[i][$i]++

You no longer require the split command and can simply use your for loops to process the array.
The gotcha here would be making sure you sort the second dimension and not the first

rng · 02-12-2014, 08:00 PM

for following test file:

Code:

A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4

I tried following code:

Code:

#!/bin/bash
gawk '{
	for ( i = 1; i <= NF; i++)		
                     counter[i][$i]++
}
END {
	for (j=1; j<=length(counter); j++){
		for (i in counter[j]) {
			print counter[j][i], ":",$counter[j][i]; 
			}
		print "========================"
	}
}' <$1

But it does not work properly. The counts are correct but not the labels. The output was:

Code:

2 : B2
1 : A1
1 : A1
========================
1 : A1
1 : A1
2 : B2
========================
1 : A1
2 : B2
1 : A1
========================
2 : B2
1 : A1
========================
1 : A1
========================

danielbmartin · 02-12-2014, 08:28 PM

This solution uses one-dimensional arrays.

With this InFile ...

Code:

A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4

... this awk ...

Code:

awk '{line[NR]=$0" "; if (nc<NF) nc=NF};
  END{for (col=1;col<=nc;col++) {delete count;
      for (row=1;row<=NR;row++)
         {fb=index(line[row]," ");
          if (fb>0) {name=substr(line[row],1,fb-1);   
                     line[row]=substr(line[row],fb+1);
                     count[name]++}} 
      n=asorti(count,b); for (j=1;j<=n;j++)
      print b[j],":",count[b[j]]
      print "==============="}}' <$InFile >$OutFile

... produced this OutFile ...

Code:

A1 : 2
B1 : 1
C1 : 1
===============
A2 : 1
B2 : 2
C2 : 1
===============
A3 : 1
B3 : 1
C3 : 2
===============
A4 : 2
B4 : 1
===============
B5 : 1
===============

Daniel B. Martin

rng · 02-12-2014, 09:37 PM

Thanks for your solution.
How can we use 2-dimensional array? I suspect it will be easier to understand.

danielbmartin · 02-12-2014, 09:58 PM

Quote:

Originally Posted by rng

Thanks for your solution.
How can we use 2-dimensional array? I suspect it will be easier to understand.

Other LQ members already posted solutions using 2-dimensional arrays. I did my one-dimensional solution to show a different approach. My computer runs a back-level awk so I cannot use the true two-dimensional language feature shown by others.

Daniel B. Martin

.