LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-01-2010, 04:07 PM   #1
HuJo
LQ Newbie
 
Registered: Sep 2010
Posts: 5

Rep: Reputation: 0
Looping through csv file to calculate avg/min/max


I have a csv file (id, loc, timestamp, impressions) that I need to derive some low-level calculations from. Im ok with basic sort and grep operations, but this one is challenging my feeble scripting skills (normally Id throw it into a spreadsheet, but in this case Id like to ultimately create a script that I could automate). Ive already used cat/sort to order the csv file first on location and then timestamp so it looks like this:

Abt001, AA, 2009-01-01 10:51:00, 231
Cup002, AA, 2009-01-01 10:54:00, 3
Saf001, AB, 2009-01-01 10:51:00, 25
Tho001, AC, 2009-01-01 10:51:00, 79

Det004, ZZ, 2009-01-01 12:23:00, 12
Ver006, ZZ, 2009-01-01 23:02:00, 341

What Id like to do is output to file a record of the total, min, and max number of impressions for each location. , e.g.

AA 6504 3 231
AB 25 25 25
AC 3792 5 109

ZZ 355 12 341

I cant quite figure out how to do this and Im kind of stuck on how to approach it I think the combined steps of looping through the locations along with the min/max/sum calculations are throwing me. Eventually Id also like to do additional calculations like average, ect. but I think I can figure that out on my own if I can just get this part down. Any help would be nice. Thanks!
 
Old 09-01-2010, 05:20 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Hi, welcome to LQ!

Something like this?

Code:
$ cat location.txt 
Abt001, AA, 2009-01-01 10:51:00, 231
Cup002, AA, 2009-01-01 10:54:00, 3
Saf001, AB, 2009-01-01 10:51:00, 25
Tho001, AC, 2009-01-01 10:51:00, 79
Det004, ZZ, 2009-01-01 12:23:00, 12
Ver006, ZZ, 2009-01-01 23:02:00, 341
Code:
$ cat loc.awk
BEGIN{
  FS=","
  printf "%s\t%6s\t%6s\t%6s\n" , "Loc", "total","min","max"
}
{
  loc[$2,$4]=$4
  total[$2]+=$4
}
END{
  for (k in total){
    for (i in loc){
      split(i,sep,"\034")
      if( sep[1] == k){
        tmp[sep[2]]=1
      }
    }
    num=asorti(tmp, tmp2)
    printf "%s:\t%6d\t%6d\t%6d\n", k,total[k],tmp2[1],tmp2[num]
    delete tmp 
    delete tmp2
  }
}
Code:
$ awk -f loc.awk location.txt 
Loc      total     min     max
 ZZ:       353      12     341
 AA:       234     231       3
 AB:        25      25      25
 AC:        79      79      79


Cheers,
Tink
 
Old 09-01-2010, 05:24 PM   #3
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 5,051

Rep: Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852
I don't understand where the 2nd field comes from ?

I would use cut to get the second field then use grep on each uniq I'd.

then use awk to get the calculations.

heres my stab at it
Code:
[schneidz@hyper temp]$ cat test
Abt001, AA, 2009-01-01 10:51:00, 231
Cup002, AA, 2009-01-01 10:54:00, 3
Saf001, AB, 2009-01-01 10:51:00, 25
Tho001, AC, 2009-01-01 10:51:00, 79
Det004, ZZ, 2009-01-01 12:23:00, 12
Ver006, ZZ, 2009-01-01 23:02:00, 341
[schneidz@hyper temp]$ for item in `cut -d , -f 2 test | sort | uniq`
do
 min=`grep $item test | sort -n -t , -k 4 | awk -F , '{print $4}' | head -n 1`
 max=`grep $item test | sort -n -t , -k 4 | awk -F , '{print $4}' | tail -n 1`
 sum=`grep $item test | awk -F , '{a = a + $4} END {print a}'`
 echo item = $item :: min = $min :: max = $max :: sum = $sum
done
item = AA :: min = 3 :: max = 231 :: sum = 234
item = AB :: min = 25 :: max = 25 :: sum = 25
item = AC :: min = 79 :: max = 79 :: sum = 79
item = ZZ :: min = 12 :: max = 341 :: sum = 353

Last edited by schneidz; 09-01-2010 at 10:36 PM.
 
Old 09-01-2010, 10:23 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Just noticed two shortcomings with my initial solution ...
a) the locations are being output in arbitrary order
b) min and max were (stupidly) determined by dictionary order
rather than numerically ...
Here the improved version:
Code:
$ less loc.awk
BEGIN{
  FS=","
  printf "%s\t%6s\t%6s\t%6s\n" , "Loc", "total","min","max"
}
{
  loc[$2,$4]=$4
  total[$2]+=$4
}
END{
  for (k in total){
    for (i in loc){
      split(i,sep,"\034")
      if( sep[1] == k){
        tmp[sprintf( "%06d", sep[2])]=1
      }
    }
    num=asorti(tmp, tmp2)
    ordered[k]=sprintf("%s:\t%6d\t%6d\t%6d", k,total[k],tmp2[1],tmp2[num])
    delete tmp 
    delete tmp2
  }
  num=asorti(ordered, new)
  for(i=1;i<=num;i++){
    print ordered[new[i]]
  }
}
Code:
$ awk -f loc.awk location.txt 
Loc      total     min     max
 AA:       234       3     231
 AB:        25      25      25
 AC:        79      79      79
 ZZ:       353      12     341


Cheers,
Tink
 
Old 09-01-2010, 10:55 PM   #5
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Just nothing like a day off of work. Too much time at my hands,
and too much fun playing with numbers =}

Adding functionality for averages ...
Code:
$ cat location.txt 
Abt001, AA, 2009-01-01 10:51:00, 231
Feg001, ZZ, 2009-01-01 12:22:00, 91
Cup002, AA, 2009-01-01 10:54:00, 3
Saf001, AB, 2009-01-01 10:51:00, 25
Tho001, AC, 2009-01-01 10:51:00, 79
Dug002, AA, 2009-01-01 10:54:00, 97
Chi002, AC, 2009-01-01 10:52:00, 79
Det004, ZZ, 2009-01-01 12:23:00, 12
Ver006, ZZ, 2009-01-01 23:02:00, 341
Code:
$ cat loc.awk
BEGIN{
  FS=","
  printf "%s\t%6s\t%6s\t%6s\t%6s\n" , "Loc", "total","min","max","avg"
}
{
  loc[$2,$4]+=1
  total[$2]+=$4
}
END{
  for (k in total){
    dup=0
    for (i in loc){
      split(i,sep,"\034")
      if( sep[1] == k){
        tmp[sprintf( "%06d", sep[2])]=1
        dup+=loc[k,sep[2]]
      }
    }
    num=asorti(tmp, tmp2)
    ordered[k]=sprintf("%s:\t%6d\t%6d\t%6d\t%6.2f", k,total[k],tmp2[1],tmp2[num],total[k]/dup)
    delete tmp 
    delete tmp2
  }
  num=asorti(ordered, new)
  for(i=1;i<=num;i++){
    print ordered[new[i]]
  }
}
Code:
$ awk -f loc.awk location.txt 
Loc      total     min     max     avg
 AA:       331       3     231  110.33
 AB:        25      25      25   25.00
 AC:       158      79      79   79.00
 ZZ:       444      12     341  148.00


Cheers,
Tink
 
Old 09-01-2010, 10:58 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by schneidz View Post
for item in `cut -d , -f 2 test | sort | uniq`
do
min=`grep $item test | sort -n -t , -k 4 | awk -F , '{print $4}' | head -n 1`
max=`grep $item test | sort -n -t , -k 4 | awk -F , '{print $4}' | tail -n 1`
sum=`grep $item test | awk -F , '{a = a + $4} END {print a}'`
echo item = $item :: min = $min :: max = $max :: sum = $sum
done
[/code]
this is a super slow script. The for loop makes the first pass through the file, then for every unique lines you call grep/sort/awk/head/tail on the same file 3 times (inside the for loop).
 
1 members found this post helpful.
Old 09-01-2010, 11:08 PM   #7
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
And here's a Python alternative

Code:
import csv
from collections import defaultdict
file="file"
reader = csv.reader(open(file),delimiter=',')
uniq=defaultdict(list)
for row in reader:
    uniq[row[1].strip()].append( int(row[-1].strip()) )

for i,j in uniq.iteritems():
    print i,max(j),min(j),sum(j)
 
Old 09-01-2010, 11:20 PM   #8
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910Reputation: 910
Quote:
Originally Posted by ghostdog74 View Post
And here's a Python alternative

Code:
import csv
from collections import defaultdict
file="file"
reader = csv.reader(open(file),delimiter=',')
uniq=defaultdict(list)
for row in reader:
    uniq[row[1].strip()].append( int(row[-1].strip()) )

for i,j in uniq.iteritems():
    print i,max(j),min(j),sum(j)

Pretty :}

I produced a slightly larger input file, and timed
all three versions ;}
Code:
$ time awk -f loc.awk location.txt 
Loc      total     min     max     avg
 AA:    102279       3     231  110.33
 AB:      7725      25      25   25.00
 AC:     48822      79      79   79.00
 ZZ:    137196      12     341  148.00

real    0m0.012s
user    0m0.010s
sys     0m0.003s
$ time loc.py
AA 231 3 102279
AC 79 79 48822
AB 25 25 7725
ZZ 341 12 137196

real    0m0.064s
user    0m0.053s
sys     0m0.003s
$ time loc.sh 
item = AA :: min = 3 :: max = 231 :: sum = 102279
item = AB :: min = 25 :: max = 25 :: sum = 7725
item = AC :: min = 79 :: max = 79 :: sum = 48822
item = ZZ :: min = 12 :: max = 341 :: sum = 137196

real    0m0.107s
user    0m0.073s
sys     0m0.083s

Cheers,
Tink
 
Old 09-02-2010, 02:39 AM   #9
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,423

Rep: Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823Reputation: 2823
Well just as an alternative, needs some prettying up but I am sure you will follow:
Code:
#!/usr/bin/awk -f

BEGIN{ FS="," }

{ SUM[$2]+=$4 }

MAX[$2]<$4{MAX[$2]=$4}

!($2 in MIN){MIN[$2]=$4}

MIN[$2] > $4{MIN[$2]=$4}

END{
    num=asorti(SUM,new)
    for(i=1;i<=num;i++)
        print new[i],MIN[new[i]],MAX[new[i]],SUM[new[i]]
}
 
Old 09-02-2010, 07:52 AM   #10
HuJo
LQ Newbie
 
Registered: Sep 2010
Posts: 5

Original Poster
Rep: Reputation: 0
Holy smokes... it's like a room full of genies! You guys are amazing! That's way more help than I expected. I'm off to play with this now! Can't thank all of you enough!
 
Old 09-02-2010, 10:25 AM   #11
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 5,051

Rep: Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852
Quote:
Originally Posted by ghostdog74 View Post
this is a super slow script. The for loop makes the first pass through the file, then for every unique lines you call grep/sort/awk/head/tail on the same file 3 times (inside the for loop).
i dindt say it was a speed demon. i figured the op would run it then boil a pot of coffee, brush there teeth, go to bed then wake up the next morning with the output redirected to a file.

(+ 1 rep)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
No Max/Min/Close Button in Ubuntu 9.10 newbieinubuntu Ubuntu 2 12-31-2009 04:01 AM
Modifying the Linux IP Stack to get TTL min/max/avg values mpk_india Linux - Kernel 0 12-31-2008 05:36 AM
C++: Extract numbers from an input file, find av & max, min programmernew Programming 4 10-27-2008 11:00 AM
Calculate average from csv file in shell script khairilthegreat Linux - Newbie 5 11-21-2007 01:57 PM
Min/Max buttons? andykap Linux - General 3 12-15-2003 01:09 AM


All times are GMT -5. The time now is 10:28 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration