[SOLVED] standard deviation R vs. bash

papori · 08-13-2014, 09:15 AM

Hi,
I am tring to calculate standard deviation of a vector.
I wrote short R script that get a vector and calculate the sd.

Code:

#!/usr/bin/env Rscript
args<-commandArgs(TRUE)
openfile <- args[1]
md=read.table(openfile)

x=as.numeric(unlist(md))
sd(x)

i am executing it from the terminal like this:
./sd.script.R vec

The example vector is this:
1404208
1470129
1384566
1572675
1450707
1410318
1458955
1462355
1469413
1467187

The output is this:
51702.08

Also when i am using stdev function of excel.
On stdev in excel i know that it based on sample.
But i didnt find in the description of R anything about sampling.. (http://stat.ethz.ch/R-manual/R-patch...s/html/sd.html)

Anyway, i did a lot of analysis using the R function and now i want to change my sciprt to work only with bash operations, but to have the same results...

I found:

Code:

awk '{sum+=$1; sumsq+=$1*$1}END{print sqrt(sumsq/NR - (sum/NR)**2)}' vec

and

Code:

awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' vec

and many others..
But they all calculate the sd different than R, and their output is:
49048.9

I guess the result is different because those are not sampling the data as R do..
And this is a big difference for analysis..

Any idea on how to have the same results?

BTW , the vector that i showed for example is smaller version of what i actually have. My vectors are 600+ values.

Thanks!

weibullguy · 08-13-2014, 11:02 AM

Unless you have every observation that will ever be made of the population, the sample standard deviation is being estimated whether you use Excel, R, or Bash.

The unbiased estimate of the sample variance for i.i.d data is:

1 / (N - 1) * sum(x_i - x_bar)^2

where N is the sample size
x_i is the ith observation in the sample
x_bar is the mean of the sample
the summation is over [1, N]

Taking the square root of this gives you the corrected estimate of the sample standard deviation.

In both of the Bash examples you provide, the biased variance is being estimated. Taking the square root of the biased estimate of the variance results in the uncorrected sample standard deviation. You need to change the NR to (NR - 1) to get the unbiased variance.

grail · 08-13-2014, 11:17 AM

As a side note, no bash was used in either example ... awk is its own language and command and does not require bash to produce any output

papori · 08-13-2014, 11:27 AM

Quote:

Originally Posted by grail

As a side note, no bash was used in either example ... awk is its own language and command and does not require bash to produce any output

Thanks for the comment!

---------- Post added 08-13-14 at 11:28 AM ----------

Quote:

Originally Posted by weibullguy

Unless you have every observation that will ever be made of the population, the sample standard deviation is being estimated whether you use Excel, R, or Bash.

The unbiased estimate of the sample variance for i.i.d data is:

1 / (N - 1) * sum(x_i - x_bar)^2

where N is the sample size
x_i is the ith observation in the sample
x_bar is the mean of the sample
the summation is over [1, N]

Taking the square root of this gives you the corrected estimate of the sample standard deviation.

In both of the Bash examples you provide, the biased variance is being estimated. Taking the square root of the biased estimate of the variance results in the uncorrected sample standard deviation. You need to change the NR to (NR - 1) to get the unbiased variance.

Thanks fo the explanation!!
Solved my problem!