Sum a column
Hi guys,
I've the follow csv files... maaasw1;Total;1 maaasw2;Total;5 mbbbsw1;Total;2 mbbbsw3;Total;3 mcccsw3;Total;6 mcccsw4;Total;5 I would like to sum the digit with the following output: aaa;6 bbb;5 ccc;11 I was thinking of using awk...But how?? Thanks a million in advance... |
Hi.
How about Code:
$ cat in.txt |
You are genius...Can you briefly explain what the command does?
|
@firstfire: gensub() only works with gawk, not any other awk. Please do not rely on awk being a symlink to gawk.
@micyew: I'd use the slightly more verbose Code:
awk '# Code:
awk '# |
Quote:
Here is a readable version of the one-liner Code:
#!/usr/bin/gawk -f Code:
$ gawk -f ./sum.awk in.txt Code:
# Make executable. Run once. @NominalAnimal: you are right, I should explicitly write `gawk' instead of `awk', thanks. |
Quote:
|
Quote:
Code:
#!/usr/bin/gawk -f Code:
Seconds awk RS LANG and LC_ALL (in environment) As you can see, mawk-1.3.3 is by far the fastest, but only when using a simple record separator. GNU gawk-3.1.8 is much more sensitive to locale than the record separator; the overhead is about 0.013 seconds per megabyte of input on my machine. You cannot really compare the relative changes in run time, since the work the script does will drastically affect that, and this one does no real work. Simply picking the best awk variant for the task will yield a much bigger difference in run time. |
Hm. What a surprise. Thanks
|
Quote:
If you need predictable, efficient timing, you need to make sure you use the proper algorithms. Python is not really suitable for this, because its I/O is slow. I personally also avoid the C standard library too; it is quite slow (in the cases where I/O throughput does matter), although much faster than Python. Perl is pretty fast, but I don't like the syntax, and compiled languages with efficient libraries should prove at least a little bit faster. In my case, I use awk for these kinds of situations, because the scripts are easy to write and maintain, and I can make them robust, so they won't choke on strange input. In this current thread, I suspect the original data is from Filemaker or similar application -- the data is not CSV, its semicolon-separated-values --, and they tend to use whatever newline convention they feel like. I think I've seen all four in real-world files. A single different newline character(s) among the output is not rare at all. |
ok, I made a test on a 700 MB video,
Code:
Seconds awk RS 4 MB plain text file: Code:
0.3 gawk 3.1.7 \n and one more thing: the number of printable chars are not the same, so the time of for cycle probably not the same... |
Usual story of more than one way to skin this cat:
Code:
awk -F"[ms;]*" '{_[$2]+=$NF}END{for(i in _)print i,_[i]}' OFS=";" file |
All times are GMT -5. The time now is 09:37 AM. |