statistical analysis for selected lines of data using awk

vjramana · 08-17-2011, 11:58 PM

I have a file which contains 10000 data in three columns and I use awk to calculate the average and standard deviation over 10000 data in third column. I have no problem in it.

The code is as belwo:

Quote:

#!/usr/bin/awk -f

{sum+=$3; array[NR]=$3}
END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}
print sum/NR, " ", sqrt(sumsq/NR)," ",2*(sqrt(sumsq/NR))}

But now I want to calculate the same thing but only on selected data line 6001 to 10000 in third column. I just tired of finding way to do this.
I have tried with this code below

Quote:

#!/usr/bin/awk -f

{array[NR]=$3}
#{print}

END { for(x=6001;x<=NR;x++)
{ print array[x] }
{sum += array[x]}
{print sum}

}

#{sumsq+=((array[x]-(sum/4000))**2);}
#print sum/4000, " ", sqrt(sumsq/4000)," ",2*(sqrt(sumsq/4000))

Can anyone give suggestion on where to alter this code so that I can calculate from 6001 to 10000 lines?
Thank you.

trist007 · 08-18-2011, 12:55 AM

Code:

NR > 6000 && NR < 100001

NR = number record

David the H. · 08-18-2011, 01:12 AM

Please use [code][/code] tags around your code, to preserve formatting and to improve readability. Don't use quote tags, as they do not protect whitespace.

So just where are you having problems? As far as I can see, there's nothing wrong with the array loop. I don't really know enough about the math to evaluate the formulae at the end, however, or how you want to apply them.

The only thing I can suggest offhand is to keep a running total of the numbers added so far, so you don't have to hard-code it in. This minor modification worked for me in a quick test on a sample file. To average the lines 11-50, for example:

Code:

#!/usr/bin/awk -f

{ array[NR] = $3 }

END { for ( x=11 ; x<=50 ; x++ )
       { sum += array[x] ; tot++ ; }

       { print "sum is:" , sum , "total is:" , tot , "average is:" , sum / tot ; }
}

I'm sure you can add in the rest as necessary.

vjramana · 08-18-2011, 02:33 AM

Thank you so much David. Your code helped me a lot. The post by trist007 also beneficial for me.

Thanks again.

AnanthaP · 08-19-2011, 08:22 PM

An improvement:
("sigma (x minus x-bar)) the whole squared" reduces to "(sigma xsqaured by n) minus (x-bar quared).

ie.
you needn't put each row in an array for what is effectively a second pass.

Quote:

#!/usr/bin/awk -f

{sum+=$3; square_sum+=$3*$3; array[NR]=$3} /* assigning to array[] is redundant */
END {
sqrt (square_sum/NR - (sum/NR)*(sum/NR) )
}

PS:I don't remember awk having exponentiation.
OK

David the H. · 08-19-2011, 09:12 PM

@AnanthaP

The problem the OP stated, however, was that he wanted to evaluate only a subsection of entries. This means we need some way to a) match only the desired range of lines, and b) count only the number of lines matched.

Sure you can put most of the work into the main block instead, just add an NR match and a running-count variable to the above, but it seems to me that using an intermediate array and doing the heavy stuff in the END block makes the code a little easier to read and work with.

It looks like I also need to repeat what I mentioned above. DO NOT USE "QUOTE" TAGS AROUND CODE! Quote tags don't preserve formatting. Please use CODE tags ([code][/code]), which do.

AnanthaP · 08-20-2011, 05:21 AM

To "David_the_H"

Selecting a sub set of records has nothing to do with avoiding a redundant use of an array - once the required records are selected. The first (selecting a sub set) is straight awk technique - which is fine and I assume it is understood by the OP. The second has to do with basic statistics exercises (STAT101). As the number of records becomes more and more, efficiency and hitting RAM limits, going into SWAP area for an avoidable array would become important. I think there is value in suggesting this method (to the OP).

As to why I used QUOTE tags instead of CODE tags its because you need to "go advanced" to use CODE tags - which I saw as un-necessary for a 3 line code snippet and if you think CODE tags need to be used more often (which seems reasonable), how about getting it placed in the "quick reply" box. Empirically you would then expect more posters to use the CODE tag when appropriate.

I expect that I won't get a special lecture about putting this suggestion in the suggestions thread.

By the way, when we use awk to select a sub set of records by a pattern what will NR return? The number of selected records or total records in the files in the argument list (ref:NR, FNR etc). I'll be trying it out.

OK