need help binning data for statistics with awk
hi,
i'm trying to use awk to take a single column of data in a file (let's say the file has 100 lines), bin it (lets say create a new bin every 20 lines, so there would be 5 bins, and then do math, say take the average over the 20 number in each of the 5 bins. i know how to do the math if there is only 1 bin with the 100 data point at say machine precision Code:
awk -v N=1 '{sum+=$N} END {if (NR>0) printf "%.17g", sum/NR}' data_file.dat Code:
#!/bin/bash ps. i'm doing this because a friend who is a python coder can't get it right. should be pretty simple thanks for your thoughts! Todd |
Perhaps something like
Code:
awk 'BEGIN {binsize=20;n=0;sum=0}; \ |
You can use bash and awk. You just call the awk script from the bash script.
|
You can also use some of the builtin variables:
Code:
awk -vbin=20 '{sum+=$1}!(NR%bin){print bin,sum,sum/bin;sum=0}END{if(NR%bin)print NR%bin,sum,sum/(NR%bin)}' <(seq 1 105) |
yep yep, next i plan to use other awk built-in functions...
|
not to look a gift horse in the mouth, but...
grail, for the 1st 20 data points in the 1st bin your awk script does not yield the same answer that my awk script get's for those 1st 20 data points. i also ran my awk script on the next set of 20 data points in the 2nd bin and it was differnt too. btw, my awk script gives the same answer that excel gives for the 1st bin and 2nd bin of 20 data points. my awk script just doesn't iterate over 5 bins. i had to do that by hand. eventually i plan to run a script on 100000 data points over 100 bins, so i can't really do that by hand. allend, your script does "almost" gives the same answer as mine and excel for each of the bin of 20 data points. the precision is only out to four decimal points. so i tried writing the print statement as i wrote in mine Code:
printf "%.17g" Code:
sum/n |
i've "shamed" my python buddy into figuring it out, thanks for your guys' help!
Todd |
Just checking, but you did alter the script to look at the 5th column and not the 1st one as ours does because we only have a single column of data?
My script was producing the same results as allend based on the sequence option. Are you able to post a sample of the data, perhaps 25 - 30 entries? (of course obscure anything sensitive) |
All times are GMT -5. The time now is 04:33 PM. |