Home Forums HCL Reviews Tutorials Articles Register Search Today's Posts Mark Forums Read
 LinuxQuestions.org statistical analysis for selected lines of data using awk
 Programming This forum is for all programming questions. The question does not have to be directly related to Linux and any language is fair game.

Notices

08-18-2011, 12:58 AM   #1
vjramana
Member

Registered: Sep 2009
Posts: 89

Rep:
statistical analysis for selected lines of data using awk

I have a file which contains 10000 data in three columns and I use awk to calculate the average and standard deviation over 10000 data in third column. I have no problem in it.

The code is as belwo:
Quote:
 #!/usr/bin/awk -f {sum+=\$3; array[NR]=\$3} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);} print sum/NR, " ", sqrt(sumsq/NR)," ",2*(sqrt(sumsq/NR))}
But now I want to calculate the same thing but only on selected data line 6001 to 10000 in third column. I just tired of finding way to do this.
I have tried with this code below
Quote:
 #!/usr/bin/awk -f {array[NR]=\$3} #{print} END { for(x=6001;x<=NR;x++) { print array[x] } {sum += array[x]} {print sum} } #{sumsq+=((array[x]-(sum/4000))**2);} #print sum/4000, " ", sqrt(sumsq/4000)," ",2*(sqrt(sumsq/4000))
Can anyone give suggestion on where to alter this code so that I can calculate from 6001 to 10000 lines?
Thank you.

 08-18-2011, 01:55 AM #2 trist007 Senior Member   Registered: May 2008 Distribution: Slackware Posts: 1,033 Rep: Code: `NR > 6000 && NR < 100001` NR = number record Last edited by trist007; 08-18-2011 at 02:14 AM. 1 members found this post helpful.
 08-18-2011, 02:12 AM #3 David the H. Bash Guru   Registered: Jun 2004 Location: Osaka, Japan Distribution: Debian + kde 4 / 5 Posts: 6,837 Rep: Please use [code][/code] tags around your code, to preserve formatting and to improve readability. Don't use quote tags, as they do not protect whitespace. So just where are you having problems? As far as I can see, there's nothing wrong with the array loop. I don't really know enough about the math to evaluate the formulae at the end, however, or how you want to apply them. The only thing I can suggest offhand is to keep a running total of the numbers added so far, so you don't have to hard-code it in. This minor modification worked for me in a quick test on a sample file. To average the lines 11-50, for example: Code: ```#!/usr/bin/awk -f { array[NR] = \$3 } END { for ( x=11 ; x<=50 ; x++ ) { sum += array[x] ; tot++ ; } { print "sum is:" , sum , "total is:" , tot , "average is:" , sum / tot ; } }``` I'm sure you can add in the rest as necessary. 1 members found this post helpful.
 08-18-2011, 03:33 AM #4 vjramana Member   Registered: Sep 2009 Posts: 89 Original Poster Rep: Thank you so much David. Your code helped me a lot. The post by trist007 also beneficial for me. Thanks again.
08-19-2011, 09:22 PM   #5
AnanthaP
Member

Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 857

Rep:
An improvement:
("sigma (x minus x-bar)) the whole squared" reduces to "(sigma xsqaured by n) minus (x-bar quared).

ie.
you needn't put each row in an array for what is effectively a second pass.

Quote:
 #!/usr/bin/awk -f {sum+=\$3; square_sum+=\$3*\$3; array[NR]=\$3} /* assigning to array[] is redundant */ END { sqrt (square_sum/NR - (sum/NR)*(sum/NR) ) }
PS:I don't remember awk having exponentiation.
OK

Last edited by AnanthaP; 08-19-2011 at 09:25 PM.

 08-19-2011, 10:12 PM #6 David the H. Bash Guru   Registered: Jun 2004 Location: Osaka, Japan Distribution: Debian + kde 4 / 5 Posts: 6,837 Rep: @AnanthaP The problem the OP stated, however, was that he wanted to evaluate only a subsection of entries. This means we need some way to a) match only the desired range of lines, and b) count only the number of lines matched. Sure you can put most of the work into the main block instead, just add an NR match and a running-count variable to the above, but it seems to me that using an intermediate array and doing the heavy stuff in the END block makes the code a little easier to read and work with. It looks like I also need to repeat what I mentioned above. DO NOT USE "QUOTE" TAGS AROUND CODE! Quote tags don't preserve formatting. Please use CODE tags ([code][/code]), which do.
 08-20-2011, 06:21 AM #7 AnanthaP Member   Registered: Jul 2004 Location: Chennai, India Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC Posts: 857 Rep: To "David_the_H" Selecting a sub set of records has nothing to do with avoiding a redundant use of an array - once the required records are selected. The first (selecting a sub set) is straight awk technique - which is fine and I assume it is understood by the OP. The second has to do with basic statistics exercises (STAT101). As the number of records becomes more and more, efficiency and hitting RAM limits, going into SWAP area for an avoidable array would become important. I think there is value in suggesting this method (to the OP). As to why I used QUOTE tags instead of CODE tags its because you need to "go advanced" to use CODE tags - which I saw as un-necessary for a 3 line code snippet and if you think CODE tags need to be used more often (which seems reasonable), how about getting it placed in the "quick reply" box. Empirically you would then expect more posters to use the CODE tag when appropriate. I expect that I won't get a special lecture about putting this suggestion in the suggestions thread. By the way, when we use awk to select a sub set of records by a pattern what will NR return? The number of selected records or total records in the files in the argument list (ref:NR, FNR etc). I'll be trying it out. OK Last edited by AnanthaP; 08-20-2011 at 06:23 AM.

 Tags awk

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is Off HTML code is Off Forum Rules

 Similar Threads Thread Thread Starter Forum Replies Last Post linuxunix Linux - Newbie 14 08-11-2010 08:26 AM vjramana Linux - Newbie 3 05-17-2010 12:43 AM cliffoij Programming 2 10-15-2008 07:17 AM viveksnv Programming 9 02-28-2008 11:27 PM rascal84 Linux - General 1 05-24-2006 10:19 AM

LinuxQuestions.org

All times are GMT -5. The time now is 03:13 AM.

 Contact Us - Advertising Info - Rules - LQ Merchandise - Donations - Contributing Member - LQ Sitemap -