LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 02-10-2014, 11:34 AM   #1
rng
Senior Member
 
Registered: Aug 2011
Posts: 1,198

Rep: Reputation: 47
AWK associative array problem


Following code runs well on a file where I know there are 3 fields:
Code:
gawk '
{
	arr1[$1]++
	arr2[$2]++
	arr3[$3]++
}
END {
		for(i in arr1) { print(i,":" ,arr1[i])}
		print("===============================")
		for(i in arr2) { print(i,":" ,arr2[i]) }
		print("===============================")
		for(i in arr3) { print(i,":" ,arr3[i]) }
		print("===============================")
		
}' <infile
But is it possible to run this for all fields if the number of fields is not known beforehand? Thanks in advance.
 
Old 02-10-2014, 03:14 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Help us to help you. Provide a sample input file (10-15 lines will do). Construct a sample output file which corresponds to your sample input and post both samples here. With "InFile" and "OutFile" examples we can better understand your needs and also judge if our proposed solution fills those needs.

Daniel B. Martin
 
Old 02-10-2014, 04:46 PM   #3
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Indeed - as Daniel pointed out - a sample input/output would be useful to fully understand your requirement. Anyway, here is an example using multi-dimensional arrays:
Code:
$ cat file
A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4
$ awk '
{
  for ( i = 1; i <= NF; i++)
    arr[i,$i]++
}

END {
  for ( combined in arr ) {
    split(combined,separated,SUBSEP)
    c[separated[1]]++
    f[separated[1]] ? f[separated[1]] = (f[separated[1]] " " separated[2]) : f[separated[1]] = separated[2]
  }

  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ )
      print _[j],":",arr[i,_[j]]
    print "==============================="
  }
}' file
A1 : 2
B1 : 1
C1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
===============================
A4 : 2
B4 : 1
===============================
B5 : 1
===============================
 
Old 02-10-2014, 06:42 PM   #4
rng
Senior Member
 
Registered: Aug 2011
Posts: 1,198

Original Poster
Rep: Reputation: 47
I should have provided an input file and output but I thought the problem was very clear. Thanks colucix, for the input file and the answer. This is exactly what I was looking for. I am now trying to understand how your elegant solution works.

The output data is sorted but it is so because of the way data is entered in input file. If the input file is:
Code:
B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4
The output becomes:
Code:
Z1 : 1
A1 : 2
B1 : 1
C1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
Z1 : 1
A3 : 1
B3 : 1
C3 : 2
===============================
Z1 : 1
A4 : 2
B4 : 1
===============================
B5 : 1
===============================
Can I get sorted output data?

Last edited by rng; 02-10-2014 at 07:49 PM.
 
Old 02-10-2014, 09:28 PM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by rng View Post
Can I get sorted output data?
Sorted by what? Name or count?

Daniel B. Martin
 
Old 02-10-2014, 09:36 PM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Two-dimensional arrays are a powerful language feature but that kind of code can be difficult to write, difficult to read. At the expense of execution time you may read the input multiple times but enjoy the simplicity of one-dimensional arrays.

With this InFile ...
Code:
B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4
... this code ...
Code:
nc=$(awk '{if (NF>maxNF) maxNF=NF}; END{print maxNF}' $InFile)
rm $OutFile
for (( c=1; c<=$nc; c++ ))
  do
    awk  -v "c=$c" '{if (c<=NF) {name[$c]=$c; count[$c]++}}; 
      END{for (j in name) print name[j],":",count[j]}' $InFile >>$OutFile
    echo "=======================" >>$OutFile
done
... produced this OutFile ...
Code:
Z1 : 1
A1 : 2
B1 : 1
C1 : 1
=======================
C2 : 1
Z1 : 1
A2 : 1
B2 : 2
=======================
B3 : 1
C3 : 2
Z1 : 1
A3 : 1
=======================
A4 : 2
B4 : 1
Z1 : 1
=======================
B5 : 1
=======================
Daniel B. Martin
 
Old 02-11-2014, 02:37 AM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
To sort arrays in awk you may use the asort function. Following my previous example you can change the last part into:
Code:
  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
  }
instead of printing directly the content of the multi-dimensional array, we assign the whole string to the array a, then we sort its content and finally print it out.
 
Old 02-11-2014, 06:30 AM   #8
rng
Senior Member
 
Registered: Aug 2011
Posts: 1,198

Original Poster
Rep: Reputation: 47
Thanks for your reply. The solution looks elegant but it is producing some errors if there are blank fields. The programs that I tested is:
Code:
#! /bin/bash
awk '
{
  for ( i = 1; i <= NF; i++)
    arr[i,$i]++
}

END {
  for ( combined in arr ) {
    split(combined,separated,SUBSEP)
    c[separated[1]]++
    f[separated[1]] ? f[separated[1]] = (f[separated[1]] " " separated[2]) : f[separated[1]] = separated[2]
  }

 for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
  }
}' <datafile.txt
The data files and outputs are as follows (the error lines are marked by me):
Code:
B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

output: 

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z1 : 1
===============================
A4 : 2
B4 : 1
Z1 : 1
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
===============================
B4 : 1    <<<<<<<<<<<<< error: this output should not be there
B5 : 1
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
Z1 : 1    <<<<<<<<<<<<< error: this output should not be there
===============================







B1 B2 B3 B4
C1 C2 C3 C4
Z1 Z1 Z1 Z1
A1 A2 A3 A4
A1 B2 C3 A4

output: (all correct): 

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z1 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z1 : 1
===============================
A4 : 2
B4 : 1
C4 : 1
Z1 : 1
===============================





B1 B2 B3 B4 B5
C1 C2 C3
Z1 Z2 Z3 Z4
A1 A2 A3 A4
A1 B2 C3 A4

OUTPUT:

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z3 : 1
===============================
A4 : 2
B4 : 1
Z3 : 1     <<<<<<<< error: this output should not be there
Z4 : 1
===============================
B4 : 1     <<<<<<<< error: this output should not be there
B5 : 1
Z3 : 1     <<<<<<<< error: this output should not be there
Z4 : 1     <<<<<<<< error: this output should not be there
===============================






B1 B2 B3 B4 B5
C1 C2 C3 C4 C5
Z1 Z2 Z3 Z4 Z5
A1 A2 A3 A4 A5
A1 B2 C3 A4 Z5

OUTPUT: ALL CORRECT: 

A1 : 2
B1 : 1
C1 : 1
Z1 : 1
===============================
A2 : 1
B2 : 2
C2 : 1
Z2 : 1
===============================
A3 : 1
B3 : 1
C3 : 2
Z3 : 1
===============================
A4 : 2
B4 : 1
C4 : 1
Z4 : 1
===============================
A5 : 1
B5 : 1
C5 : 1
Z5 : 2
===============================
 
Old 02-11-2014, 07:18 AM   #9
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Ops, you right! Mine was a greenhorn's mistake: the array a retains the values of the last fields of the (previous) longer lines. Just add a delete statement at the end of the for loop:
Code:
  for ( i = 1; i <= length(c); i++ ) {
    split(f[i],_)
    for ( j = 1; j <= length(_); j++ ) {
      a[j] = ( _[j] " : " arr[i,_[j]] )
    }
    asort(a)
    for ( j = 1; j <= length(a); j++ ) {
      print a[j]
    }
    print "==============================="
    delete a
  }
 
Old 02-11-2014, 10:35 AM   #10
rng
Senior Member
 
Registered: Aug 2011
Posts: 1,198

Original Poster
Rep: Reputation: 47
Yes, it works perfectly now. Thanks again.
 
Old 02-12-2014, 02:36 AM   #11
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,006

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I thought I would just throw 2 cents in ... colucix's solution is still using a 1 dimensional associative array. The SUBSEP delimited array could just as easily use any other character
not already in the values either side of the delimiter. However, as of v4, awk does indeed offer true multi-dimensional arrays.
(See here for more details)

Using a snippet from the script:
Code:
# current associative array
for ( i = 1; i <= NF; i++)
    arr[i,$i]++

# 2 dimensional array
for ( i = 1; i <= NF; i++)
    arr[i][$i]++
You no longer require the split command and can simply use your for loops to process the array.
The gotcha here would be making sure you sort the second dimension and not the first

Last edited by grail; 02-12-2014 at 02:37 AM.
 
Old 02-12-2014, 08:00 PM   #12
rng
Senior Member
 
Registered: Aug 2011
Posts: 1,198

Original Poster
Rep: Reputation: 47
for following test file:
Code:
A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4
I tried following code:
Code:
#!/bin/bash
gawk '{
	for ( i = 1; i <= NF; i++)		
                     counter[i][$i]++
}
END {
	for (j=1; j<=length(counter); j++){
		for (i in counter[j]) {
			print counter[j][i], ":",$counter[j][i]; 
			}
		print "========================"
	}
}' <$1
But it does not work properly. The counts are correct but not the labels. The output was:
Code:
2 : B2
1 : A1
1 : A1
========================
1 : A1
1 : A1
2 : B2
========================
1 : A1
2 : B2
1 : A1
========================
2 : B2
1 : A1
========================
1 : A1
========================
 
Old 02-12-2014, 08:28 PM   #13
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
This solution uses one-dimensional arrays.

With this InFile ...
Code:
A1 A2 A3 A4
B1 B2 B3 B4 B5
C1 C2 C3
A1 B2 C3 A4
... this awk ...
Code:
awk '{line[NR]=$0" "; if (nc<NF) nc=NF};
  END{for (col=1;col<=nc;col++) {delete count;
      for (row=1;row<=NR;row++)
         {fb=index(line[row]," ");
          if (fb>0) {name=substr(line[row],1,fb-1);   
                     line[row]=substr(line[row],fb+1);
                     count[name]++}} 
      n=asorti(count,b); for (j=1;j<=n;j++)
      print b[j],":",count[b[j]]
      print "==============="}}' <$InFile >$OutFile
... produced this OutFile ...
Code:
A1 : 2
B1 : 1
C1 : 1
===============
A2 : 1
B2 : 2
C2 : 1
===============
A3 : 1
B3 : 1
C3 : 2
===============
A4 : 2
B4 : 1
===============
B5 : 1
===============
Daniel B. Martin
 
Old 02-12-2014, 09:37 PM   #14
rng
Senior Member
 
Registered: Aug 2011
Posts: 1,198

Original Poster
Rep: Reputation: 47
Thanks for your solution.
How can we use 2-dimensional array? I suspect it will be easier to understand.
 
Old 02-12-2014, 09:58 PM   #15
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by rng View Post
Thanks for your solution.
How can we use 2-dimensional array? I suspect it will be easier to understand.
Other LQ members already posted solutions using 2-dimensional arrays. I did my one-dimensional solution to show a different approach. My computer runs a back-level awk so I cannot use the true two-dimensional language feature shown by others.

Daniel B. Martin

.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Could someone please explain the concept of associative arrays in AWK programming? AJAY E Linux - Newbie 6 05-27-2012 07:01 PM
[SOLVED] Associative array in bash not unsetting grail Programming 4 04-07-2011 09:11 AM
Sorting through associative array with user input nobody123 Linux - Newbie 3 04-08-2009 01:55 PM
awk: associative array key names h/w Programming 2 10-17-2007 12:58 AM
exchange value and key per element in associative array rblampain Programming 3 04-02-2006 09:07 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 01:49 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration