LinuxQuestions.org - Average from values of fields

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Average from values of fields (https://www.linuxquestions.org/questions/programming-9/average-from-values-of-fields-896965/)

Average from values of fields

Hi,

I have a file with something as follows:

192.168.136.246 10 23 1

10.128.255.158 2 4 9

192.168.134.206 5 7 1

10.128.255.158 3 7 10

and so on...

The first field can repeat. I'd want to make an average of each field per IP address,

For example for 10.128.255.158 would output
10.128.255.158 2.5 5.5 9.5

Please could you help me to make a script with awk or perl?

Thanks in advance!

What have you got so far?
I've edited my response after I'd slightly misread the complexity.

Ok, to average we need either all the values for a specific averaging to be stored before we calculate the average (e.g. for 192.168.136.246's first column we store [10, 5]), or we do a rolling average calculation of the current average and how many values were used to calculate that figure (e.g. [7.5, 2]).

So you need a mapping per IP address, which maps to either:

a list of lists, array of arrays, whatever you want to call it, so we can collect all the values for each 'column' entry e.g. 192.168.136.246->[ [10, 5], [23, 7], [1, 1] ]
for each new row found with the same IP, we retrieve any existing stored info, and per 'column' list we add the new entries. e.g. after reading just the first 2 rows the above would have been 192.168.136.246->[ [10], [23], [1] ]
Once all rows have been read we can calculate the averages and maybe store it in a new mapping of IP->list of averages with one entry per column.

or:

your original IP mapping is to a class/data structure/simple 2 element array/list that has the current average and current values used count (IP->[ [a1, n1], [a2,n2], [a3, n3] ]).
again for each new row found with the same IP, we retrieve any existing stored info , but instead of just putting a new entry element on the end of a list, we calculate the previous total from multiplying the 2 stored values, then calc the new average with a count of +1, and store the new average and count ([a1, n1] and new x becomes [ a1*n1 +x / n1+1 ]).
We constantly have the average per IP per column in this data structure, once we've read the last row we're done. Also we use far less memory, usage is proportional to the number of columns times the number of ip addresses, whereas the other option grows with each extra row read.

Yes awk could easily do this. What have you tried and where are you getting stuck?

Since the OP has been quiet, here is an example awk script.

Code:

awk '(NF>=4) { n[$1]++

              v1[$1]+=$2

              v2[$1]+=$3

              v3[$1]+=$4

            }

        END { for (ip in n)

                  printf("%s %f %f %f (%d)\n", ip, v1[ip]/n[ip], v2[ip]/n[ip], v3[ip]/n[ip], n[ip])

            }' data-file

The IP addresses are the keys to the arrays. n counts the number of entries. (Increasing an unset value will yield 1, and adding something to an unset value will yield the value itself, as per awk rules; in other words, unset is logically equal to zero.)

The END rule will be processed after all the records (lines) have been processed. ip will loop through all keys in the n array, therefore through all IP addresses. Since the v1, v2 and v3 arrays count the sums of the fields, dividing by the number of summands will yield the average.

I added the number of occurrences at the end of the line in parenthesis for illustration.
You can format the fields (e.g. %.3f instead of just %f to your needs, too).