LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-13-2014, 10:15 AM   #1
papori
LQ Newbie
 
Registered: Feb 2011
Posts: 23

Rep: Reputation: 0
standard deviation R vs. bash


Hi,
I am tring to calculate standard deviation of a vector.
I wrote short R script that get a vector and calculate the sd.
Code:
#!/usr/bin/env Rscript
args<-commandArgs(TRUE)
openfile <- args[1]
md=read.table(openfile)

x=as.numeric(unlist(md))
sd(x)
i am executing it from the terminal like this:
./sd.script.R vec

The example vector is this:
1404208
1470129
1384566
1572675
1450707
1410318
1458955
1462355
1469413
1467187

The output is this:
51702.08

Also when i am using stdev function of excel.
On stdev in excel i know that it based on sample.
But i didnt find in the description of R anything about sampling.. (http://stat.ethz.ch/R-manual/R-patch...s/html/sd.html)

Anyway, i did a lot of analysis using the R function and now i want to change my sciprt to work only with bash operations, but to have the same results...

I found:
Code:
awk '{sum+=$1; sumsq+=$1*$1}END{print sqrt(sumsq/NR - (sum/NR)**2)}' vec
and
Code:
awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' vec
and many others..
But they all calculate the sd different than R, and their output is:
49048.9

I guess the result is different because those are not sampling the data as R do..
And this is a big difference for analysis..

Any idea on how to have the same results?

BTW , the vector that i showed for example is smaller version of what i actually have. My vectors are 600+ values.

Thanks!
 
Old 08-13-2014, 12:02 PM   #2
weibullguy
ReliaFree Maintainer
 
Registered: Aug 2004
Location: Kalamazoo, Michigan
Distribution: Slackware-current, Cross Linux from Scratch, Gentoo
Posts: 2,812
Blog Entries: 1

Rep: Reputation: 259Reputation: 259Reputation: 259
Unless you have every observation that will ever be made of the population, the sample standard deviation is being estimated whether you use Excel, R, or Bash.

The unbiased estimate of the sample variance for i.i.d data is:

1 / (N - 1) * sum(x_i - x_bar)^2

where N is the sample size
x_i is the ith observation in the sample
x_bar is the mean of the sample
the summation is over [1, N]

Taking the square root of this gives you the corrected estimate of the sample standard deviation.

In both of the Bash examples you provide, the biased variance is being estimated. Taking the square root of the biased estimate of the variance results in the uncorrected sample standard deviation. You need to change the NR to (NR - 1) to get the unbiased variance.

Last edited by weibullguy; 08-13-2014 at 12:07 PM.
 
1 members found this post helpful.
Old 08-13-2014, 12:17 PM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,256

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
As a side note, no bash was used in either example ... awk is its own language and command and does not require bash to produce any output
 
Old 08-13-2014, 12:27 PM   #4
papori
LQ Newbie
 
Registered: Feb 2011
Posts: 23

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by grail View Post
As a side note, no bash was used in either example ... awk is its own language and command and does not require bash to produce any output
Thanks for the comment!

---------- Post added 08-13-14 at 11:28 AM ----------

Quote:
Originally Posted by weibullguy View Post
Unless you have every observation that will ever be made of the population, the sample standard deviation is being estimated whether you use Excel, R, or Bash.

The unbiased estimate of the sample variance for i.i.d data is:

1 / (N - 1) * sum(x_i - x_bar)^2

where N is the sample size
x_i is the ith observation in the sample
x_bar is the mean of the sample
the summation is over [1, N]

Taking the square root of this gives you the corrected estimate of the sample standard deviation.

In both of the Bash examples you provide, the biased variance is being estimated. Taking the square root of the biased estimate of the variance results in the uncorrected sample standard deviation. You need to change the NR to (NR - 1) to get the unbiased variance.
Thanks fo the explanation!!
Solved my problem!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk syntax for calculating average and standard deviation vjramana Programming 6 04-21-2011 10:36 AM
Need computationally fast approximation to standard deviation. jiml8 Programming 22 11-24-2008 12:41 AM
error when finding the standard deviation of a vector mshinska Programming 5 10-26-2005 12:03 AM
parsing standard input with bash arosales Programming 3 07-16-2005 11:42 AM
charting mean and standard deviation allelopath Linux - Software 2 02-04-2005 03:36 PM


All times are GMT -5. The time now is 02:35 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration