ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
The first column ranges from 0 to 8 and the second column from 0 to 4.
Is it possible to count and output the recurrence of the number in the second column within a defined range (let's say 0,5) of the first one?
If I understood you correctly, you are trying to build a histogram of the second field in your data, considering only records where the first field is within the desired range. (I am assuming range is inclusive, i.e. both endpoints are considered to be within the range.)
Try
Code:
awk -v min=0 -v max=5 '($1 >= min && $1 <= max) { n++ ; bin[$2]++ } END { for (i in bin) printf("%s %s %s %d %d %.3f%%\n", min, max, i, bin[i], n, 100.0*bin[i]/n) }' input-file
The output will contain lines of format: minmaxvalueoccurrencessamples-in-rangepercentage-of-occurrences%
If you have multiple ranges you wish to consider at once, put the ranges in another file, and use an associative array containing both range identifier and bin identifier. I prefer to use pipe (|) as the separator. (Since awk simulates multiple dimensions by merging keys, it is simpler to construct the key explicitly yourself, and then just consider the matching keys.)
Code:
awk -v rangefile=rangefile '
BEGIN {
ranges = 0
while ((getline < rangefile) > 0)
if (NF == 2 && $1 <= $2) {
ranges++
min[ranges] = $1
max[ranges] = $2
}
close(ranges)
}
(NF >= 2) {
for (r = 1; r <= ranges; r++)
if ($1 >= min[r] && $1 <= max[r]) {
n[r]++
hits[r "|" $2]++
}
}
END {
for (r = 1; r <= ranges; r++) {
split("", h)
rstr = r "|"
rlen = length(rstr)
for (i in hits)
if (substr(i, 1, rlen) == rstr)
h[substr(i, 1+rlen)] += hits[i]
for (i in h)
printf("%s %s %s %d %d %.3f%%\n", min[r], max[r], i, h[i], n[r], 100.0*h[i]/n[r])
}
}
' input-file
The output format is the same for both scripts. The latter script is otherwise the same, except each record is compared against each defined range, and in the END rule, a loop picks only the entries in the histogram that belong to one specific range.
Let me know if you would like me to describe the inner workings of the longer one statement by statement.
Last edited by Nominal Animal; 12-07-2011 at 10:31 AM.
Thank you for your reply.
At the moment I'm not in front of my computer so I cannot test the script, but from what I see it seems that the output I should get is actually different from the one you propose with your script.
The idea is to get a file where all the possible combination of the variables in $1 matches the ones in $2 and in $3 the sum of all the hits of each single combination.
One problem that I'm finding is how to define the range to attribute to the variables in $1. In fact, the values in $1 should be divided in a small range contrariwise to the ones in $2 which are integers.
Again to be more clear the output should look like this
$1 ranges $2integers
[0 - 0.5] 0 #of hits for this match
[0.5 - 1] 0 #of hits for this match
[1 - 1.5] 0 #of hits for this match
[1.5 - 2 ] 0 #of hits for this match
...
...
Basically if you consider this you should get a 16 (range from 0 up to 8 with 0.5 step) * 4 (0 to 4 integers)line output.
Thank you again for your support!
The idea is to get a file where all the possible combination of the variables in $1 matches the ones in $2 and in $3 the sum of all the hits of each single combination.
No, I'm sorry, I don't understand what you mean.
Could you show the exact output for your example input? No $1 or $2 or ... or [ or ], please; the exact actual output you need.
If you are looking for a generic two-dimensional histogram, you could use
This one considers each record to be a 2D vector, x y. The bold parameters define the grid (minimum, maximum, and number of bins). The output tells how many vectors pointed to each cell in the grid. The output format is xn*yn lines containing xmin xmax ymin ymax occurrences percentage%
Note that the minimum bound is included in the region, but the maximum bound is not. This means that given min=2 max=6, a value of 6 is outside the region. (Usually you just need to extend the maximum enough for a new cell. min=2 max=7 n=5 would give you integer cells.)
For 2D histograms, typically only the centerpoints of the sampled regions are shown, i.e.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.