LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-07-2011, 09:50 AM   #1
bldcerealkiller
LQ Newbie
 
Registered: Aug 2011
Posts: 16

Rep: Reputation: Disabled
Binning datawith AWK


Hi everybody, I'm struggling to find a way out hope you can help me.
I have a file that looks like this

...
5.059259 0
4.892560 2
4.795937 0
4.357734 1
4.432540 0
4.502526 3
4.468403 0
...

The first column ranges from 0 to 8 and the second column from 0 to 4.
Is it possible to count and output the recurrence of the number in the second column within a defined range (let's say 0,5) of the first one?

The final output should look like this:

$1RANGE $2RANGE $3 recurrence/number of hits

[0] [0] [..]
[0,5] [0] [..]
.....
.....
[7] [0] [..]
[7,5] [0] [..]
.....
.....
[0] [1] [..]
[0,5] [1] [..]
......
.....
[7,5] [1] [..]
[8] [1] [..]
....
....
....
....
[0] [4] [..]
[0,5] [4] [..]
....
[7,5 [4] [..]
[8] [4] [..]

Thank you in advance for your reply and sorry if I wasn't too clear.
Regards
 
Old 12-07-2011, 10:30 AM   #2
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
If I understood you correctly, you are trying to build a histogram of the second field in your data, considering only records where the first field is within the desired range. (I am assuming range is inclusive, i.e. both endpoints are considered to be within the range.)

Try
Code:
awk -v min=0 -v max=5 '($1 >= min && $1 <= max) { n++ ; bin[$2]++ } END { for (i in bin) printf("%s %s %s %d %d %.3f%%\n", min, max, i, bin[i], n, 100.0*bin[i]/n) }' input-file
The output will contain lines of format: min max value occurrences samples-in-range percentage-of-occurrences%

If you have multiple ranges you wish to consider at once, put the ranges in another file, and use an associative array containing both range identifier and bin identifier. I prefer to use pipe (|) as the separator. (Since awk simulates multiple dimensions by merging keys, it is simpler to construct the key explicitly yourself, and then just consider the matching keys.)
Code:
awk -v rangefile=rangefile '
   BEGIN {
       ranges = 0
       while ((getline < rangefile) > 0)
           if (NF == 2 && $1 <= $2) {
               ranges++
               min[ranges] = $1
               max[ranges] = $2
           }
       close(ranges)
   }

   (NF >= 2) {
       for (r = 1; r <= ranges; r++)
           if ($1 >= min[r] && $1 <= max[r]) {
               n[r]++
               hits[r "|" $2]++
           }
   }

   END {
       for (r = 1; r <= ranges; r++) {
           split("", h)
           rstr = r "|"
           rlen = length(rstr)
           for (i in hits)
               if (substr(i, 1, rlen) == rstr)
                   h[substr(i, 1+rlen)] += hits[i]

           for (i in h)
               printf("%s %s %s %d %d %.3f%%\n", min[r], max[r], i, h[i], n[r], 100.0*h[i]/n[r])
       }
   }
' input-file
The output format is the same for both scripts. The latter script is otherwise the same, except each record is compared against each defined range, and in the END rule, a loop picks only the entries in the histogram that belong to one specific range.

Let me know if you would like me to describe the inner workings of the longer one statement by statement.

Last edited by Nominal Animal; 12-07-2011 at 10:31 AM.
 
Old 12-07-2011, 12:40 PM   #3
bldcerealkiller
LQ Newbie
 
Registered: Aug 2011
Posts: 16

Original Poster
Rep: Reputation: Disabled
Thank you for your reply.
At the moment I'm not in front of my computer so I cannot test the script, but from what I see it seems that the output I should get is actually different from the one you propose with your script.
The idea is to get a file where all the possible combination of the variables in $1 matches the ones in $2 and in $3 the sum of all the hits of each single combination.
One problem that I'm finding is how to define the range to attribute to the variables in $1. In fact, the values in $1 should be divided in a small range contrariwise to the ones in $2 which are integers.

Again to be more clear the output should look like this

$1 ranges $2integers
[0 - 0.5] 0 #of hits for this match
[0.5 - 1] 0 #of hits for this match
[1 - 1.5] 0 #of hits for this match
[1.5 - 2 ] 0 #of hits for this match
...
...

Basically if you consider this you should get a 16 (range from 0 up to 8 with 0.5 step) * 4 (0 to 4 integers)line output.
Thank you again for your support!

Regards
 
Old 12-07-2011, 01:15 PM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by bldcerealkiller View Post
The idea is to get a file where all the possible combination of the variables in $1 matches the ones in $2 and in $3 the sum of all the hits of each single combination.
No, I'm sorry, I don't understand what you mean.

Could you show the exact output for your example input? No $1 or $2 or ... or [ or ], please; the exact actual output you need.

If you are looking for a generic two-dimensional histogram, you could use
Code:
awk -v xmin=4.1 -v xmax=5.1 -v xn=5 \
    -v ymin=-0.5 -v ymax=3.5 -v yn=4 '

    ($1 >= xmin && $1 < xmax && $2 >= ymin && $2 < ymax) {
        xi = int(xn * ($1 - xmin) / (xmax - xmin)) ; if (xi >= xn) xi = xn - 1
        yi = int(yn * ($1 - ymin) / (ymax - ymin)) ; if (yi >= yn) yi = yn - 1
        h[yi,xi]++
        n++
    }

    END {
        xscale = (xmax - xmin) / xn
        yscale = (ymax - ymin) / yn
        if (n > 0)
            percent = 100.0 / n
        else
            percent = 0.0
        for (yi = 0; yi < yn; yi++)
            for (xi = 0; xi < xn; xi++)
                printf("%.8g %.8g %.8g %.8g %d %.3f%%\n",
                       xmin + xscale * xi, xmin + xscale * (xi + 1),
                       ymin + yscale * yi, ymin + yscale * (yi + 1),
                       h[yi,xi], percent * h[yi,xi])
    }
' input-file > output-file
This one considers each record to be a 2D vector, x y. The bold parameters define the grid (minimum, maximum, and number of bins). The output tells how many vectors pointed to each cell in the grid. The output format is xn*yn lines containing xmin xmax ymin ymax occurrences percentage%

Note that the minimum bound is included in the region, but the maximum bound is not. This means that given min=2 max=6, a value of 6 is outside the region. (Usually you just need to extend the maximum enough for a new cell. min=2 max=7 n=5 would give you integer cells.)

For 2D histograms, typically only the centerpoints of the sampled regions are shown, i.e.
Code:
                printf("%.8g %.8g %d %.3f%%\n",
                       xmin + xscale * (xi + 0.5),
                       ymin + yscale * (yi + 0.5),
                       h[yi,xi], percent * h[yi,xi])

Last edited by Nominal Animal; 12-07-2011 at 01:18 PM.
 
Old 12-08-2011, 03:27 PM   #5
bldcerealkiller
LQ Newbie
 
Registered: Aug 2011
Posts: 16

Original Poster
Rep: Reputation: Disabled
Thanks to one of my colleague I've found what i was looking for.
Thank you anyway for the other examples!

Regards

Code:
awk '{c[int($1/.5),$2]++}END{for(f=0;f<=8;f+=.5)for(s=0;s<=4;s++)printf"%.1f %d %d \n",f,s,c[f/.5,s]}' inputfile
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
parsing a text file - to awk or not to awk ? rollyah Programming 9 08-18-2011 02:20 PM
[SOLVED] call awk from bash script behaves differently to awk from CLI = missing newlines titanium_geek Programming 4 05-26-2011 09:06 PM
[SOLVED] awk: how can I assign value to a shell variable inside awk? quanba Programming 6 03-23-2010 02:18 AM
shell command using awk fields inside awk one71 Programming 6 06-26-2008 04:11 PM
Some comments on awk and awk scripts makyo Programming 4 03-02-2008 05:39 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:29 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration