LinuxQuestions.org - [SOLVED] need help binning data for statistics with awk

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - need help binning data for statistics with awk (https://www.linuxquestions.org/questions/linux-newbie-8/need-help-binning-data-for-statistics-with-awk-4175619028/)

need help binning data for statistics with awk

hi,

i'm trying to use awk to take a single column of data in a file (let's say the file has 100 lines), bin it (lets say create a new bin every 20 lines, so there would be 5 bins, and then do math, say take the average over the 20 number in each of the 5 bins.

i know how to do the math if there is only 1 bin with the 100 data point at say machine precision

Code:

awk -v N=1 '{sum+=$N} END {if (NR>0) printf "%.17g", sum/NR}' data_file.dat

i'm thinking that i should wite this as a bash script that first break the data into bins using a while and then does the math, so in psuedo-code something like

Code:

#!/bin/bash



while read data_file.dat

do

counter1 = 1

counter2 = 1

while[counter1 = 1 to 5,

  do

    while[counter2 = 1 to 20]

      do 

        awk -v N=1 '{sum+=$N} END {if (NR>0) printf "%.17g", sum/NR}'

      counter2 ++

    done

  counter1 ++

done

done

of course it doesn't work but you get the idea.
ps. i'm doing this because a friend who is a python coder can't get it right. should be pretty simple

thanks for your thoughts!

Todd

Perhaps something like

Code:

awk 'BEGIN {binsize=20;n=0;sum=0}; \

    {n++;sum+=$1;if (n%binsize==0) {print n,sum,sum/n;sum=0;n=0}}; \

    END {if (n>0) {print n,sum,sum/n} }' \

    <(seq 1 100)

I have tried to cover the case where the input file does not contain a number of lines that is not an exact multiple of bins.

You can use bash and awk. You just call the awk script from the bash script.

You can also use some of the builtin variables:

Code:

awk -vbin=20 '{sum+=$1}!(NR%bin){print bin,sum,sum/bin;sum=0}END{if(NR%bin)print NR%bin,sum,sum/(NR%bin)}' <(seq 1 105)

If you are using multiple files as input, change the NR to FNR and END to ENDFILE

yep yep, next i plan to use other awk built-in functions...

not to look a gift horse in the mouth, but...

grail, for the 1st 20 data points in the 1st bin your awk script does not yield the same answer that my awk script get's for those 1st 20 data points. i also ran my awk script on the next set of 20 data points in the 2nd bin and it was differnt too. btw, my awk script gives the same answer that excel gives for the 1st bin and 2nd bin of 20 data points. my awk script just doesn't iterate over 5 bins. i had to do that by hand. eventually i plan to run a script on 100000 data points over 100 bins, so i can't really do that by hand.

allend, your script does "almost" gives the same answer as mine and excel for each of the bin of 20 data points. the precision is only out to four decimal points. so i tried writing the print statement as i wrote in mine

Code:

printf "%.17g"

now your script gave the correct precision, but it didn't do the division on the answer by n (in this example n=20). when i do the division by 20 by hand i get the right answer. so some how i need include the 20. i also tried setting n in the

Code:

sum/n

term equal to 20 and "force" the division but that also didn't work?

i've "shamed" my python buddy into figuring it out, thanks for your guys' help!

Todd

Just checking, but you did alter the script to look at the 5th column and not the 1st one as ours does because we only have a single column of data?
My script was producing the same results as allend based on the sequence option.

Are you able to post a sample of the data, perhaps 25 - 30 entries? (of course obscure anything sensitive)