LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-05-2017, 03:25 PM   #1
atjurhs
Member
 
Registered: Aug 2012
Posts: 220

Rep: Reputation: Disabled
need help binning data for statistics with awk


hi,

i'm trying to use awk to take a single column of data in a file (let's say the file has 100 lines), bin it (lets say create a new bin every 20 lines, so there would be 5 bins, and then do math, say take the average over the 20 number in each of the 5 bins.

i know how to do the math if there is only 1 bin with the 100 data point at say machine precision

Code:
awk -v N=1 '{sum+=$N} END {if (NR>0) printf "%.17g", sum/NR}' data_file.dat
i'm thinking that i should wite this as a bash script that first break the data into bins using a while and then does the math, so in psuedo-code something like

Code:
#!/bin/bash

while read data_file.dat
do
counter1 = 1
counter2 = 1
while[counter1 = 1 to 5,
  do
    while[counter2 = 1 to 20]
       do 
         awk -v N=1 '{sum+=$N} END {if (NR>0) printf "%.17g", sum/NR}'
       counter2 ++
     done
   counter1 ++
done
done
of course it doesn't work but you get the idea.
ps. i'm doing this because a friend who is a python coder can't get it right. should be pretty simple

thanks for your thoughts!

Todd

Last edited by atjurhs; 12-05-2017 at 04:42 PM.
 
Old 12-05-2017, 07:36 PM   #2
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 5,027

Rep: Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761Reputation: 1761
Perhaps something like
Code:
awk 'BEGIN {binsize=20;n=0;sum=0}; \
    {n++;sum+=$1;if (n%binsize==0) {print n,sum,sum/n;sum=0;n=0}}; \
    END {if (n>0) {print n,sum,sum/n} }' \
    <(seq 1 100)
I have tried to cover the case where the input file does not contain a number of lines that is not an exact multiple of bins.

Last edited by allend; 12-05-2017 at 07:40 PM.
 
Old 12-05-2017, 11:47 PM   #3
AwesomeMachine
LQ Guru
 
Registered: Jan 2005
Location: USA and Italy
Distribution: Debian testing/sid; OpenSuSE; Fedora; Mint
Posts: 5,482

Rep: Reputation: 997Reputation: 997Reputation: 997Reputation: 997Reputation: 997Reputation: 997Reputation: 997Reputation: 997
You can use bash and awk. You just call the awk script from the bash script.
 
Old 12-06-2017, 02:11 AM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,627

Rep: Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943
You can also use some of the builtin variables:
Code:
awk -vbin=20 '{sum+=$1}!(NR%bin){print bin,sum,sum/bin;sum=0}END{if(NR%bin)print NR%bin,sum,sum/(NR%bin)}' <(seq 1 105)
If you are using multiple files as input, change the NR to FNR and END to ENDFILE
 
1 members found this post helpful.
Old 12-06-2017, 09:08 AM   #5
atjurhs
Member
 
Registered: Aug 2012
Posts: 220

Original Poster
Rep: Reputation: Disabled
yep yep, next i plan to use other awk built-in functions...
 
Old 12-06-2017, 11:48 AM   #6
atjurhs
Member
 
Registered: Aug 2012
Posts: 220

Original Poster
Rep: Reputation: Disabled
not to look a gift horse in the mouth, but...

grail, for the 1st 20 data points in the 1st bin your awk script does not yield the same answer that my awk script get's for those 1st 20 data points. i also ran my awk script on the next set of 20 data points in the 2nd bin and it was differnt too. btw, my awk script gives the same answer that excel gives for the 1st bin and 2nd bin of 20 data points. my awk script just doesn't iterate over 5 bins. i had to do that by hand. eventually i plan to run a script on 100000 data points over 100 bins, so i can't really do that by hand.

allend, your script does "almost" gives the same answer as mine and excel for each of the bin of 20 data points. the precision is only out to four decimal points. so i tried writing the print statement as i wrote in mine

Code:
 printf "%.17g"
now your script gave the correct precision, but it didn't do the division on the answer by n (in this example n=20). when i do the division by 20 by hand i get the right answer. so some how i need include the 20. i also tried setting n in the
Code:
sum/n
term equal to 20 and "force" the division but that also didn't work?

Last edited by atjurhs; 12-06-2017 at 11:51 AM.
 
Old 12-06-2017, 03:09 PM   #7
atjurhs
Member
 
Registered: Aug 2012
Posts: 220

Original Poster
Rep: Reputation: Disabled
i've "shamed" my python buddy into figuring it out, thanks for your guys' help!

Todd
 
Old 12-06-2017, 09:34 PM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,627

Rep: Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943Reputation: 2943
Just checking, but you did alter the script to look at the 5th column and not the 1st one as ours does because we only have a single column of data?
My script was producing the same results as allend based on the sequence option.

Are you able to post a sample of the data, perhaps 25 - 30 entries? (of course obscure anything sensitive)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] AWK: how to process data multiple times in awk pix9 Programming 11 04-24-2014 08:31 AM
binning iteratively row wise and_y Linux - Newbie 3 08-25-2013 11:40 AM
Traffic Data Statistics for 2011 richinsc General 0 12-28-2011 11:35 AM
[SOLVED] Binning datawith AWK bldcerealkiller Programming 4 12-08-2011 04:27 PM
Data manipulation with awk chrisF682 Programming 5 09-25-2011 03:17 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 09:50 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration