[SOLVED] Binning datawith AWK

bldcerealkiller · 12-07-2011, 09:50 AM

Hi everybody, I'm struggling to find a way out hope you can help me.
I have a file that looks like this

...
5.059259 0
4.892560 2
4.795937 0
4.357734 1
4.432540 0
4.502526 3
4.468403 0
...

The first column ranges from 0 to 8 and the second column from 0 to 4.
Is it possible to count and output the recurrence of the number in the second column within a defined range (let's say 0,5) of the first one?

The final output should look like this:

$1RANGE $2RANGE $3 recurrence/number of hits

[0] [0] [..]
[0,5] [0] [..]
.....
.....
[7] [0] [..]
[7,5] [0] [..]
.....
.....
[0] [1] [..]
[0,5] [1] [..]
......
.....
[7,5] [1] [..]
[8] [1] [..]
....
....
....
....
[0] [4] [..]
[0,5] [4] [..]
....
[7,5 [4] [..]
[8] [4] [..]

Thank you in advance for your reply and sorry if I wasn't too clear.
Regards

Nominal Animal · 12-07-2011, 10:30 AM

If I understood you correctly, you are trying to build a histogram of the second field in your data, considering only records where the first field is within the desired range. (I am assuming range is inclusive, i.e. both endpoints are considered to be within the range.)

Try

Code:

awk -v min=0 -v max=5 '($1 >= min && $1 <= max) { n++ ; bin[$2]++ } END { for (i in bin) printf("%s %s %s %d %d %.3f%%\n", min, max, i, bin[i], n, 100.0*bin[i]/n) }' input-file

The output will contain lines of format: min max value occurrences samples-in-range percentage-of-occurrences%

If you have multiple ranges you wish to consider at once, put the ranges in another file, and use an associative array containing both range identifier and bin identifier. I prefer to use pipe (|) as the separator. (Since awk simulates multiple dimensions by merging keys, it is simpler to construct the key explicitly yourself, and then just consider the matching keys.)

Code:

awk -v rangefile=rangefile '
   BEGIN {
       ranges = 0
       while ((getline < rangefile) > 0)
           if (NF == 2 && $1 <= $2) {
               ranges++
               min[ranges] = $1
               max[ranges] = $2
           }
       close(ranges)
   }

   (NF >= 2) {
       for (r = 1; r <= ranges; r++)
           if ($1 >= min[r] && $1 <= max[r]) {
               n[r]++
               hits[r "|" $2]++
           }
   }

   END {
       for (r = 1; r <= ranges; r++) {
           split("", h)
           rstr = r "|"
           rlen = length(rstr)
           for (i in hits)
               if (substr(i, 1, rlen) == rstr)
                   h[substr(i, 1+rlen)] += hits[i]

           for (i in h)
               printf("%s %s %s %d %d %.3f%%\n", min[r], max[r], i, h[i], n[r], 100.0*h[i]/n[r])
       }
   }
' input-file

The output format is the same for both scripts. The latter script is otherwise the same, except each record is compared against each defined range, and in the END rule, a loop picks only the entries in the histogram that belong to one specific range.

Let me know if you would like me to describe the inner workings of the longer one statement by statement.

bldcerealkiller · 12-07-2011, 12:40 PM

Thank you for your reply.
At the moment I'm not in front of my computer so I cannot test the script, but from what I see it seems that the output I should get is actually different from the one you propose with your script.
The idea is to get a file where all the possible combination of the variables in $1 matches the ones in $2 and in $3 the sum of all the hits of each single combination.
One problem that I'm finding is how to define the range to attribute to the variables in $1. In fact, the values in $1 should be divided in a small range contrariwise to the ones in $2 which are integers.

Again to be more clear the output should look like this

$1 ranges $2integers
[0 - 0.5] 0 #of hits for this match
[0.5 - 1] 0 #of hits for this match
[1 - 1.5] 0 #of hits for this match
[1.5 - 2 ] 0 #of hits for this match
...
...

Basically if you consider this you should get a 16 (range from 0 up to 8 with 0.5 step) * 4 (0 to 4 integers)line output.
Thank you again for your support!

Regards

Nominal Animal · 12-07-2011, 01:15 PM

Quote:

Originally Posted by bldcerealkiller

The idea is to get a file where all the possible combination of the variables in $1 matches the ones in $2 and in $3 the sum of all the hits of each single combination.

No, I'm sorry, I don't understand what you mean.

Could you show the exact output for your example input? No $1 or $2 or ... or [ or ], please; the exact actual output you need.

If you are looking for a generic two-dimensional histogram, you could use

Code:

awk -v xmin=4.1 -v xmax=5.1 -v xn=5 \
    -v ymin=-0.5 -v ymax=3.5 -v yn=4 '

    ($1 >= xmin && $1 < xmax && $2 >= ymin && $2 < ymax) {
        xi = int(xn * ($1 - xmin) / (xmax - xmin)) ; if (xi >= xn) xi = xn - 1
        yi = int(yn * ($1 - ymin) / (ymax - ymin)) ; if (yi >= yn) yi = yn - 1
        h[yi,xi]++
        n++
    }

    END {
        xscale = (xmax - xmin) / xn
        yscale = (ymax - ymin) / yn
        if (n > 0)
            percent = 100.0 / n
        else
            percent = 0.0
        for (yi = 0; yi < yn; yi++)
            for (xi = 0; xi < xn; xi++)
                printf("%.8g %.8g %.8g %.8g %d %.3f%%\n",
                       xmin + xscale * xi, xmin + xscale * (xi + 1),
                       ymin + yscale * yi, ymin + yscale * (yi + 1),
                       h[yi,xi], percent * h[yi,xi])
    }
' input-file > output-file

This one considers each record to be a 2D vector, x y. The bold parameters define the grid (minimum, maximum, and number of bins). The output tells how many vectors pointed to each cell in the grid. The output format is xn*yn lines containing xmin xmax ymin ymax occurrences percentage%

Note that the minimum bound is included in the region, but the maximum bound is not. This means that given min=2 max=6, a value of 6 is outside the region. (Usually you just need to extend the maximum enough for a new cell. min=2 max=7 n=5 would give you integer cells.)

For 2D histograms, typically only the centerpoints of the sampled regions are shown, i.e.

Code:

                printf("%.8g %.8g %d %.3f%%\n",
                       xmin + xscale * (xi + 0.5),
                       ymin + yscale * (yi + 0.5),
                       h[yi,xi], percent * h[yi,xi])

bldcerealkiller · 12-08-2011, 03:27 PM

Thanks to one of my colleague I've found what i was looking for.
Thank you anyway for the other examples!

Regards

Code:

awk '{c[int($1/.5),$2]++}END{for(f=0;f<=8;f+=.5)for(s=0;s<=4;s++)printf"%.1f %d %d \n",f,s,c[f/.5,s]}' inputfile