LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 06-25-2012, 09:37 PM   #1
micyew
LQ Newbie
 
Registered: Jan 2011
Posts: 17

Rep: Reputation: 0
Sum a column


Hi guys,

I've the follow csv files...

maaasw1;Total;1
maaasw2;Total;5
mbbbsw1;Total;2
mbbbsw3;Total;3
mcccsw3;Total;6
mcccsw4;Total;5

I would like to sum the digit with the following output:

aaa;6
bbb;5
ccc;11

I was thinking of using awk...But how?? Thanks a million in advance...
 
Old 06-25-2012, 10:25 PM   #2
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 636

Rep: Reputation: 372Reputation: 372Reputation: 372Reputation: 372
Hi.

How about
Code:
$ cat in.txt 
maaasw1;Total;1
maaasw2;Total;5
mbbbsw1;Total;2
mbbbsw3;Total;3
mcccsw3;Total;6
mcccsw4;Total;5
$ awk -F';' '{T[gensub(/m(.*)sw[0-9]+/,"\\1",1,$1)]+=$3} END{for(t in T) print t";"T[t] }' in.txt 
aaa;6
bbb;5
ccc;11
 
Old 06-25-2012, 11:04 PM   #3
micyew
LQ Newbie
 
Registered: Jan 2011
Posts: 17

Original Poster
Rep: Reputation: 0
You are genius...Can you briefly explain what the command does?
 
Old 06-25-2012, 11:25 PM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
@firstfire: gensub() only works with gawk, not any other awk. Please do not rely on awk being a symlink to gawk.

@micyew: I'd use the slightly more verbose
Code:
awk '#
    BEGIN {
        # Accept any newline convention; ignore leading and trailing whitespace.
        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"

        # Fields are separated by semicolons.
        FS = ";"

        # For output, use newline and semicolons.
        ORS = "\n"
        OFS = ";"
    }

    # For each input record:
    {
        key = $1
        sub(/^m/, "", key)    # Remove leading m
        sub(/sw1$/, "", key)  # Remove trailing sw1

        # add count in third field.
        sum[key] += $3
    }

    # After all records have been processed:
    END {
        # Loop over all unique keys we have, and print the sum.
        for (key in sum)
            printf("%s%s%s%s", key, OFS, sum[key], ORS)
    }
' input-file > output-file
The above outputs the results in essentially random order. If you want to keep the initial order, use
Code:
awk '#
    BEGIN {
        # Accept any newline convention; ignore leading and trailing whitespace.
        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"

        # Fields are separated by semicolons.
        FS = ";"

        # For output, use newline and semicolons.
        ORS = "\n"
        OFS = ";"
    }

    # For each input record:
    {
        key = $1
        sub(/^m/, "", key)    # Remove leading m
        sub(/sw1$/, "", key)  # Remove trailing sw1

        # Add unique keys to keylist
        if (!(key in keylist))
            keylist[++keys] = key

        # add count in third field.
        sum[key] += $3
    }

    # After all records have been processed:
    END {
        # Loop over all unique keys we have, and print the sum.
        for (k = 1; k <= keys; k++) {
            key = keylist[k]
            printf("%s%s%s%s", key, OFS, sum[key], ORS)
        }
    }
' input-file > output-file
Firstfire's GNU awk script (use gawk) does have the benefit that GNU awk has support for sorting; with just two added statements (an asorti() to sort the array keys into a new array, and picking the key from that array into a temporary variable in the loop) you can have the output sorted.
 
Old 06-25-2012, 11:55 PM   #5
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 636

Rep: Reputation: 372Reputation: 372Reputation: 372Reputation: 372
Quote:
Originally Posted by micyew View Post
Can you briefly explain what the command does?
First, we set semicolon ';' as the field separator (-F';'). Now each line consists of three fields, of which we are interested in first and third (e.g. $1="maaasw1", $3=1 for the first line). Using gensub() we strip extra characters from $1 to obtain the key (e.g. "aaa"), which is then used as an index of the associative (that is indexed by strings) array T (table). Using command like T["aaa"]+=1 we sum up values corresponding to the key "aaa". Finally, in the END{} rule we print out array T.

Here is a readable version of the one-liner
Code:
#!/usr/bin/gawk -f

BEGIN{ FS = ";" }

{
	key=gensub(/m(.*)sw[0-9]+/,"\\1",1,$1)
	T[key]+=$3
}

END{
	for(t in T) 
		print t";"T[t]
}
Save it to some file, say sum.awk and call either as
Code:
$ gawk -f ./sum.awk in.txt 
aaa;6
bbb;5
ccc;11
or
Code:
# Make executable. Run once.
$ chmod +x ./sum.awk
$ ./sum.awk in.txt 
aaa;6
bbb;5
ccc;11
Hope, it is more clear now. See `man awk', `info gawk' or numerous online tutorials for further details on syntax.

@NominalAnimal: you are right, I should explicitly write `gawk' instead of `awk', thanks.

Last edited by firstfire; 06-25-2012 at 11:56 PM.
 
Old 06-26-2012, 02:06 AM   #6
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,838

Rep: Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287
Quote:
Originally Posted by Nominal Animal View Post
# Accept any newline convention; ignore leading and trailing whitespace.
RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"
Probably we can open another thread to discuss it: such a complex RS can degrade performance.
 
Old 06-26-2012, 03:52 AM   #7
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Quote:
Originally Posted by pan64 View Post
Probably we can open another thread to discuss it: such a complex RS can degrade performance.
If that matters, then you're using the wrong tool. Consider this trivial awk script:
Code:
#!/usr/bin/gawk -f
BEGIN { RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*" }
{ for (i = 1; i <= NF; i++) chars += length($i) }
END { printf("%d printable characters.\n", chars) }
with the red bits varied. I generated an input file of roughly one and a half megabytes in size, over 10000 lines, with varying number of tokens per line, varying line lengths, using only linefeeds as newlines. Here are the timing results (all in real time):
Code:
Seconds  awk         RS     LANG and LC_ALL (in environment)
0.026    mawk-1.3.3  \n     C
0.026    mawk-1.3.3  \n     en_US.utf8
0.043    gawk-3.1.8  \n     C
0.066    gawk-3.1.8  long   C
0.101    gawk-3.1.8  \n     en_US.utf8
0.123    gawk-3.1.8  long   en_US.utf8
0.313    mawk-1.3.3  long   en_US.utf8
0.313    mawk-1.3.3  long   C
and long meaning exactly "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*" .

As you can see, mawk-1.3.3 is by far the fastest, but only when using a simple record separator. GNU gawk-3.1.8 is much more sensitive to locale than the record separator; the overhead is about 0.013 seconds per megabyte of input on my machine. You cannot really compare the relative changes in run time, since the work the script does will drastically affect that, and this one does no real work.

Simply picking the best awk variant for the task will yield a much bigger difference in run time.

Last edited by Nominal Animal; 06-26-2012 at 03:58 AM.
 
1 members found this post helpful.
Old 06-26-2012, 04:19 AM   #8
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,838

Rep: Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287
Hm. What a surprise. Thanks
 
Old 06-26-2012, 04:38 AM   #9
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942Reputation: 942
Quote:
Originally Posted by pan64 View Post
What a surprise.
The variance in mawk runtime was quite a large surprise to me. I'm running on x86-64 and 64-bit binaries, so the results might be different on 32-bit x86. Feel free to run your own benchmarks, and post them here. I would be surprised if we saw any kind of clear trends at all; small changes in versions, scripts, or environment variables are going to significantly change the run time, more or less chaotically.

If you need predictable, efficient timing, you need to make sure you use the proper algorithms. Python is not really suitable for this, because its I/O is slow. I personally also avoid the C standard library too; it is quite slow (in the cases where I/O throughput does matter), although much faster than Python. Perl is pretty fast, but I don't like the syntax, and compiled languages with efficient libraries should prove at least a little bit faster.

In my case, I use awk for these kinds of situations, because the scripts are easy to write and maintain, and I can make them robust, so they won't choke on strange input.

In this current thread, I suspect the original data is from Filemaker or similar application -- the data is not CSV, its semicolon-separated-values --, and they tend to use whatever newline convention they feel like. I think I've seen all four in real-world files. A single different newline character(s) among the output is not rare at all.
 
Old 06-26-2012, 12:41 PM   #10
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,838

Rep: Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287Reputation: 1287
ok, I made a test on a 700 MB video,
Code:
Seconds  awk         RS   
6        gawk 3.1.7  \n
17       gawk 3.1.7  long
1.5      mawk 1.3.3  \n
1.5      mawk 1.3.3  long
but mawk could not find printable chars.

4 MB plain text file:
Code:
0.3      gawk 3.1.7  \n
0.5      gawk 3.1.7  long
0.1      mawk 1.3.3  \n
2.6      mawk 1.3.3  long
this result is more or less similar. I tested it with C locale


and one more thing: the number of printable chars are not the same, so the time of for cycle probably not the same...

Last edited by pan64; 06-26-2012 at 12:43 PM.
 
Old 06-26-2012, 02:30 PM   #11
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,541

Rep: Reputation: 1919Reputation: 1919Reputation: 1919Reputation: 1919Reputation: 1919Reputation: 1919Reputation: 1919Reputation: 1919Reputation: 1919Reputation: 1919Reputation: 1919
Usual story of more than one way to skin this cat:
Code:
awk -F"[ms;]*" '{_[$2]+=$NF}END{for(i in _)print i,_[i]}' OFS=";" file
Can always sort the array if the output order is important.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
compare second column of a file then print the first column of it in a ne fil if true java_girl Linux - Newbie 2 03-16-2012 04:50 AM
[SOLVED] Sum numbers in a column in a specific block Alkass Programming 3 07-15-2011 01:12 PM
[PHP] Sum of a column by key kazuni Programming 4 11-09-2009 08:54 AM


All times are GMT -5. The time now is 11:59 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration