statistical analysis for selected lines of data using awk
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
statistical analysis for selected lines of data using awk
I have a file which contains 10000 data in three columns and I use awk to calculate the average and standard deviation over 10000 data in third column. I have no problem in it.
The code is as belwo:
Quote:
#!/usr/bin/awk -f
{sum+=$3; array[NR]=$3}
END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}
print sum/NR, " ", sqrt(sumsq/NR)," ",2*(sqrt(sumsq/NR))}
But now I want to calculate the same thing but only on selected data line 6001 to 10000 in third column. I just tired of finding way to do this.
I have tried with this code below
Please use [code][/code] tags around your code, to preserve formatting and to improve readability. Don't use quote tags, as they do not protect whitespace.
So just where are you having problems? As far as I can see, there's nothing wrong with the array loop. I don't really know enough about the math to evaluate the formulae at the end, however, or how you want to apply them.
The only thing I can suggest offhand is to keep a running total of the numbers added so far, so you don't have to hard-code it in. This minor modification worked for me in a quick test on a sample file. To average the lines 11-50, for example:
Code:
#!/usr/bin/awk -f
{ array[NR] = $3 }
END { for ( x=11 ; x<=50 ; x++ )
{ sum += array[x] ; tot++ ; }
{ print "sum is:" , sum , "total is:" , tot , "average is:" , sum / tot ; }
}
The problem the OP stated, however, was that he wanted to evaluate only a subsection of entries. This means we need some way to a) match only the desired range of lines, and b) count only the number of lines matched.
Sure you can put most of the work into the main block instead, just add an NR match and a running-count variable to the above, but it seems to me that using an intermediate array and doing the heavy stuff in the END block makes the code a little easier to read and work with.
It looks like I also need to repeat what I mentioned above. DO NOT USE "QUOTE" TAGS AROUND CODE! Quote tags don't preserve formatting. Please use CODE tags ([code][/code]), which do.
Selecting a sub set of records has nothing to do with avoiding a redundant use of an array - once the required records are selected. The first (selecting a sub set) is straight awk technique - which is fine and I assume it is understood by the OP. The second has to do with basic statistics exercises (STAT101). As the number of records becomes more and more, efficiency and hitting RAM limits, going into SWAP area for an avoidable array would become important. I think there is value in suggesting this method (to the OP).
As to why I used QUOTE tags instead of CODE tags its because you need to "go advanced" to use CODE tags - which I saw as un-necessary for a 3 line code snippet and if you think CODE tags need to be used more often (which seems reasonable), how about getting it placed in the "quick reply" box. Empirically you would then expect more posters to use the CODE tag when appropriate.
I expect that I won't get a special lecture about putting this suggestion in the suggestions thread.
By the way, when we use awk to select a sub set of records by a pattern what will NR return? The number of selected records or total records in the files in the argument list (ref:NR, FNR etc). I'll be trying it out.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.