LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 08-17-2011, 11:58 PM   #1
vjramana
Member
 
Registered: Sep 2009
Posts: 88

Rep: Reputation: 0
statistical analysis for selected lines of data using awk


I have a file which contains 10000 data in three columns and I use awk to calculate the average and standard deviation over 10000 data in third column. I have no problem in it.

The code is as belwo:
Quote:
#!/usr/bin/awk -f

{sum+=$3; array[NR]=$3}
END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}
print sum/NR, " ", sqrt(sumsq/NR)," ",2*(sqrt(sumsq/NR))}
But now I want to calculate the same thing but only on selected data line 6001 to 10000 in third column. I just tired of finding way to do this.
I have tried with this code below
Quote:
#!/usr/bin/awk -f

{array[NR]=$3}
#{print}

END { for(x=6001;x<=NR;x++)
{ print array[x] }
{sum += array[x]}
{print sum}

}

#{sumsq+=((array[x]-(sum/4000))**2);}
#print sum/4000, " ", sqrt(sumsq/4000)," ",2*(sqrt(sumsq/4000))
Can anyone give suggestion on where to alter this code so that I can calculate from 6001 to 10000 lines?
Thank you.
 
Old 08-18-2011, 12:55 AM   #2
trist007
Member
 
Registered: May 2008
Distribution: Slackware
Posts: 974

Rep: Reputation: 56
Code:
NR > 6000 && NR < 100001
NR = number record

Last edited by trist007; 08-18-2011 at 01:14 AM.
 
1 members found this post helpful.
Old 08-18-2011, 01:12 AM   #3
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
Please use [code][/code] tags around your code, to preserve formatting and to improve readability. Don't use quote tags, as they do not protect whitespace.

So just where are you having problems? As far as I can see, there's nothing wrong with the array loop. I don't really know enough about the math to evaluate the formulae at the end, however, or how you want to apply them.

The only thing I can suggest offhand is to keep a running total of the numbers added so far, so you don't have to hard-code it in. This minor modification worked for me in a quick test on a sample file. To average the lines 11-50, for example:
Code:
#!/usr/bin/awk -f

{ array[NR] = $3 }

END { for ( x=11 ; x<=50 ; x++ )
       { sum += array[x] ; tot++ ; }

       { print "sum is:" , sum , "total is:" , tot , "average is:" , sum / tot ; }
}
I'm sure you can add in the rest as necessary.
 
1 members found this post helpful.
Old 08-18-2011, 02:33 AM   #4
vjramana
Member
 
Registered: Sep 2009
Posts: 88

Original Poster
Rep: Reputation: 0
Thank you so much David. Your code helped me a lot. The post by trist007 also beneficial for me.

Thanks again.
 
Old 08-19-2011, 08:22 PM   #5
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 627

Rep: Reputation: 137Reputation: 137
An improvement:
("sigma (x minus x-bar)) the whole squared" reduces to "(sigma xsqaured by n) minus (x-bar quared).

ie.
you needn't put each row in an array for what is effectively a second pass.

Quote:
#!/usr/bin/awk -f

{sum+=$3; square_sum+=$3*$3; array[NR]=$3} /* assigning to array[] is redundant */
END {
sqrt (square_sum/NR - (sum/NR)*(sum/NR) )
}
PS:I don't remember awk having exponentiation.
OK

Last edited by AnanthaP; 08-19-2011 at 08:25 PM.
 
Old 08-19-2011, 09:12 PM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
@AnanthaP

The problem the OP stated, however, was that he wanted to evaluate only a subsection of entries. This means we need some way to a) match only the desired range of lines, and b) count only the number of lines matched.

Sure you can put most of the work into the main block instead, just add an NR match and a running-count variable to the above, but it seems to me that using an intermediate array and doing the heavy stuff in the END block makes the code a little easier to read and work with.

It looks like I also need to repeat what I mentioned above. DO NOT USE "QUOTE" TAGS AROUND CODE! Quote tags don't preserve formatting. Please use CODE tags ([code][/code]), which do.
 
Old 08-20-2011, 05:21 AM   #7
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 627

Rep: Reputation: 137Reputation: 137
To "David_the_H"

Selecting a sub set of records has nothing to do with avoiding a redundant use of an array - once the required records are selected. The first (selecting a sub set) is straight awk technique - which is fine and I assume it is understood by the OP. The second has to do with basic statistics exercises (STAT101). As the number of records becomes more and more, efficiency and hitting RAM limits, going into SWAP area for an avoidable array would become important. I think there is value in suggesting this method (to the OP).

As to why I used QUOTE tags instead of CODE tags its because you need to "go advanced" to use CODE tags - which I saw as un-necessary for a 3 line code snippet and if you think CODE tags need to be used more often (which seems reasonable), how about getting it placed in the "quick reply" box. Empirically you would then expect more posters to use the CODE tag when appropriate.

I expect that I won't get a special lecture about putting this suggestion in the suggestions thread.

By the way, when we use awk to select a sub set of records by a pattern what will NR return? The number of selected records or total records in the files in the argument list (ref:NR, FNR etc). I'll be trying it out.

OK

Last edited by AnanthaP; 08-20-2011 at 05:23 AM.
 
  


Reply

Tags
awk


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Surveying the Linux/Unix Customer Statistical data linuxunix Linux - Newbie 14 08-11-2010 07:26 AM
numerical operation on selected lines and column using AWK program vjramana Linux - Newbie 3 05-16-2010 11:43 PM
AWK- processing data inside start/stop pairs but ignoring start/stop lines cliffoij Programming 2 10-15-2008 06:17 AM
getting selected lines in a file viveksnv Programming 9 02-28-2008 10:27 PM
awk/gawk/sed - read lines from file1, comment out or delete matching lines in file2 rascal84 Linux - General 1 05-24-2006 09:19 AM


All times are GMT -5. The time now is 09:13 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration