[SOLVED] Script to sum values across columns if they have the same row title

kmkocot · 08-02-2012, 05:42 PM

Hi all,

I have a large file that looks like this:

Code:

I would like to get the summation for all the values associated with 0007, 0010, 0011, and 0015 and produce an output file that looks like this:

Code:

Can anyone suggest a straightforward way to implement this?

Thank so much!
Kevin

grail · 08-02-2012, 08:54 PM

What is your attempt and where are you stuck?

schneidz · 08-02-2012, 09:27 PM

awk would be the most straightforward way. it could match the regular expressions and sum the fields that match that regex.

kmkocot · 08-03-2012, 11:09 AM

Got something that works! Thanks all!

Code:

sed -n '/^0007/p' number_of_positions_cleaned_in_each_file.txt | awk '{ sum+=$2} END {print sum}'

schneidz · 08-03-2012, 11:22 AM

Quote:

Originally Posted by kmkocot

Got something that works! Thanks all!

Code:

sed -n '/^0007/p' number_of_positions_cleaned_in_each_file.txt | awk '{ sum+=$2} END {print sum}'

good job. this mite be more efficient

Code:

awk '/^0007/ { sum+=$2} END {print sum}' kmkocot.txt

David the H. · 08-04-2012, 12:13 AM

I think the OP made a good-faith effort, so here's how I would solve it. We use an array instead of a simple variable.

Code:

gawk 'BEGIN{ FS=OFS="\t" ; PROCINFO["sorted_in"]="@ind_num_asc" } { sum[$1]+=$2 } END{ for ( i in sum ){ print int( i ) , sum[i] } }' infile.txt

This relies on a feature found only in gnu awk version 4+, array sorting, so I called it specifically. On most Linuxes it should be the default awk.

Code:

BEGIN{ FS=$OFS="\t" ; PROCINFO["sorted_in"]="@ind_num_asc" }

This sets the input and output delimiters to tab, and gawk's (again v4+) internal array sorting to index-numeric-ascending. Otherwise the final output will be random in respect to the input.

http://www.gnu.org/software/gawk/man...y-Sorting.html

We could use the asorti function instead, but I find this way to be easier.
If you're using a version of awk that doesn't support sorting, then the easiest option is probably to just pipe the output through sort -n -k1 afterwards.

Code:

{ sum[$1]+=$2 }

Run through every line and store the values in an array, With indexes based on field 1. Every line that has the same $1 will have it's $2 value added to that entry. This is exactly like the "sum+=$2" the OP used, but allows for tracking multiple arbitrary values.

Code:

END{ for ( i in sum ){ print int( i ) , sum[i] } }

At the end of the file loop through the array. Print the index ( the $1 fields ), and the final value for that array entry. I used the int function on the i values to strip off the leading zeroes first. It's also possible to use printf and the %d tokens to format the output in the same way.

grail · 08-04-2012, 09:35 AM

How about:

Code:

awk -vidx="0007" '{sum[$1]+=$2}END{print sum[idx]}' file

Tinkster · 08-05-2012, 07:19 PM

Moved: This thread is more suitable in <programming> and has been moved accordingly to help your thread/question get the exposure it deserves.