LinuxQuestions.org - Sum a column

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Sum a column (https://www.linuxquestions.org/questions/programming-9/sum-a-column-4175413365/)

Hi guys,

I've the follow csv files...

maaasw1;Total;1
maaasw2;Total;5
mbbbsw1;Total;2
mbbbsw3;Total;3
mcccsw3;Total;6
mcccsw4;Total;5

I would like to sum the digit with the following output:

aaa;6
bbb;5
ccc;11

I was thinking of using awk...But how?? Thanks a million in advance...

Hi.

How about

Code:

$ cat in.txt 

maaasw1;Total;1

maaasw2;Total;5

mbbbsw1;Total;2

mbbbsw3;Total;3

mcccsw3;Total;6

mcccsw4;Total;5

$ awk -F';' '{T[gensub(/m(.*)sw[0-9]+/,"\\1",1,$1)]+=$3} END{for(t in T) print t";"T[t] }' in.txt 

aaa;6

bbb;5

ccc;11

You are genius...Can you briefly explain what the command does?

@firstfire: gensub() only works with gawk, not any other awk. Please do not rely on awk being a symlink to gawk.

@micyew: I'd use the slightly more verbose

Code:

awk '#

    BEGIN {

        # Accept any newline convention; ignore leading and trailing whitespace.

        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"



        # Fields are separated by semicolons.

        FS = ";"



        # For output, use newline and semicolons.

        ORS = "\n"

        OFS = ";"

    }



    # For each input record:

    {

        key = $1

        sub(/^m/, "", key)    # Remove leading m

        sub(/sw1$/, "", key)  # Remove trailing sw1



        # add count in third field.

        sum[key] += $3

    }



    # After all records have been processed:

    END {

        # Loop over all unique keys we have, and print the sum.

        for (key in sum)

            printf("%s%s%s%s", key, OFS, sum[key], ORS)

    }

' input-file > output-file

The above outputs the results in essentially random order. If you want to keep the initial order, use

Code:

awk '#

    BEGIN {

        # Accept any newline convention; ignore leading and trailing whitespace.

        RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"



        # Fields are separated by semicolons.

        FS = ";"



        # For output, use newline and semicolons.

        ORS = "\n"

        OFS = ";"

    }



    # For each input record:

    {

        key = $1

        sub(/^m/, "", key)    # Remove leading m

        sub(/sw1$/, "", key)  # Remove trailing sw1



        # Add unique keys to keylist

        if (!(key in keylist))

            keylist[++keys] = key



        # add count in third field.

        sum[key] += $3

    }



    # After all records have been processed:

    END {

        # Loop over all unique keys we have, and print the sum.

        for (k = 1; k <= keys; k++) {

            key = keylist[k]

            printf("%s%s%s%s", key, OFS, sum[key], ORS)

        }

    }

' input-file > output-file

Firstfire's GNU awk script (use gawk) does have the benefit that GNU awk has support for sorting; with just two added statements (an asorti() to sort the array keys into a new array, and picking the key from that array into a temporary variable in the loop) you can have the output sorted.

Quote:

Originally Posted by micyew (Post 4711922)

Can you briefly explain what the command does?

First, we set semicolon ';' as the field separator (-F';'). Now each line consists of three fields, of which we are interested in first and third (e.g. $1="maaasw1", $3=1 for the first line). Using gensub() we strip extra characters from $1 to obtain the key (e.g. "aaa"), which is then used as an index of the associative (that is indexed by strings) array T (table). Using command like T["aaa"]+=1 we sum up values corresponding to the key "aaa". Finally, in the END{} rule we print out array T.

Here is a readable version of the one-liner

Code:

#!/usr/bin/gawk -f



BEGIN{ FS = ";" }



{

        key=gensub(/m(.*)sw[0-9]+/,"\\1",1,$1)

        T[key]+=$3

}



END{

        for(t in T) 

                print t";"T[t]

}

Save it to some file, say sum.awk and call either as

Code:

$ gawk -f ./sum.awk in.txt 

aaa;6

bbb;5

ccc;11

Code:

# Make executable. Run once.

$ chmod +x ./sum.awk

$ ./sum.awk in.txt 

aaa;6

bbb;5

ccc;11

Hope, it is more clear now. See `man awk', `info gawk' or numerous online tutorials for further details on syntax.

@NominalAnimal: you are right, I should explicitly write `gawk' instead of `awk', thanks.

Quote:

Originally Posted by Nominal Animal (Post 4711935)

# Accept any newline convention; ignore leading and trailing whitespace.
RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*"

Probably we can open another thread to discuss it: such a complex RS can degrade performance.

Quote:

Originally Posted by pan64 (Post 4712006)

Probably we can open another thread to discuss it: such a complex RS can degrade performance.

If that matters, then you're using the wrong tool. Consider this trivial awk script:

Code:

#!/usr/bin/gawk -f

BEGIN { RS = "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*" }

{ for (i = 1; i <= NF; i++) chars += length($i) }

END { printf("%d printable characters.\n", chars) }

with the red bits varied. I generated an input file of roughly one and a half megabytes in size, over 10000 lines, with varying number of tokens per line, varying line lengths, using only linefeeds as newlines. Here are the timing results (all in real time):

Code:

Seconds  awk        RS    LANG and LC_ALL (in environment)

0.026    mawk-1.3.3  \n    C

0.026    mawk-1.3.3  \n    en_US.utf8

0.043    gawk-3.1.8  \n    C

0.066    gawk-3.1.8  long  C

0.101    gawk-3.1.8  \n    en_US.utf8

0.123    gawk-3.1.8  long  en_US.utf8

0.313    mawk-1.3.3  long  en_US.utf8

0.313    mawk-1.3.3  long  C

and long meaning exactly "[\t\v\f ]*(\r\n|\n\r|\r|\n)[\t\v\f ]*" .

As you can see, mawk-1.3.3 is by far the fastest, but only when using a simple record separator. GNU gawk-3.1.8 is much more sensitive to locale than the record separator; the overhead is about 0.013 seconds per megabyte of input on my machine. You cannot really compare the relative changes in run time, since the work the script does will drastically affect that, and this one does no real work.

Simply picking the best awk variant for the task will yield a much bigger difference in run time.

Hm. What a surprise. Thanks

Quote:

Originally Posted by pan64 (Post 4712283)

What a surprise.

The variance in mawk runtime was quite a large surprise to me. I'm running on x86-64 and 64-bit binaries, so the results might be different on 32-bit x86. Feel free to run your own benchmarks, and post them here. I would be surprised if we saw any kind of clear trends at all; small changes in versions, scripts, or environment variables are going to significantly change the run time, more or less chaotically.

If you need predictable, efficient timing, you need to make sure you use the proper algorithms. Python is not really suitable for this, because its I/O is slow. I personally also avoid the C standard library too; it is quite slow (in the cases where I/O throughput does matter), although much faster than Python. Perl is pretty fast, but I don't like the syntax, and compiled languages with efficient libraries should prove at least a little bit faster.

In my case, I use awk for these kinds of situations, because the scripts are easy to write and maintain, and I can make them robust, so they won't choke on strange input.

In this current thread, I suspect the original data is from Filemaker or similar application -- the data is not CSV, its semicolon-separated-values --, and they tend to use whatever newline convention they feel like. I think I've seen all four in real-world files. A single different newline character(s) among the output is not rare at all.

ok, I made a test on a 700 MB video,

Code:

Seconds  awk        RS  

6        gawk 3.1.7  \n

17      gawk 3.1.7  long

1.5      mawk 1.3.3  \n

1.5      mawk 1.3.3  long

but mawk could not find printable chars.

4 MB plain text file:

Code:

0.3      gawk 3.1.7  \n

0.5      gawk 3.1.7  long

0.1      mawk 1.3.3  \n

2.6      mawk 1.3.3  long

this result is more or less similar. I tested it with C locale

and one more thing: the number of printable chars are not the same, so the time of for cycle probably not the same...

Usual story of more than one way to skin this cat:

Code:

awk -F"[ms;]*" '{_[$2]+=$NF}END{for(i in _)print i,_[i]}' OFS=";" file

Can always sort the array if the output order is important.