LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Script to sum values across columns if they have the same row title (https://www.linuxquestions.org/questions/programming-9/script-to-sum-values-across-columns-if-they-have-the-same-row-title-4175420069/)

kmkocot 08-02-2012 05:42 PM

Script to sum values across columns if they have the same row title
 
Hi all,

I have a large file that looks like this:
Code:

0007        10
0007        2
0007        14
0010        13
0010        3
0010        6
0010        15
0010        8
0010        24
0010        9
0010        1
0010        5
0010        14
0010        4
0010        1
0010        30
0011        6
0011        15
0011        15
0011        8
0011        15
0011        14
0011        8
0011        16
0011        27
0011        3
0011        5
0011        18
0011        6
0011        3
0011        7
0015        14
0015        20
0015        15
0015        5
0015        27
0015        13
0015        20
0015        28
0015        5
0015        6
0015        2
0015        14
0015        13
0015        7
0015        3

I would like to get the summation for all the values associated with 0007, 0010, 0011, and 0015 and produce an output file that looks like this:
Code:

7        26
10        133
11        166
15        192

Can anyone suggest a straightforward way to implement this?

Thank so much!
Kevin

grail 08-02-2012 08:54 PM

What is your attempt and where are you stuck?

schneidz 08-02-2012 09:27 PM

awk would be the most straightforward way. it could match the regular expressions and sum the fields that match that regex.

kmkocot 08-03-2012 11:09 AM

Got something that works! Thanks all!

Code:

sed -n '/^0007/p' number_of_positions_cleaned_in_each_file.txt | awk '{ sum+=$2} END {print sum}'

schneidz 08-03-2012 11:22 AM

Quote:

Originally Posted by kmkocot (Post 4745114)
Got something that works! Thanks all!

Code:

sed -n '/^0007/p' number_of_positions_cleaned_in_each_file.txt | awk '{ sum+=$2} END {print sum}'

good job. this mite be more efficient
Code:

awk '/^0007/ { sum+=$2} END {print sum}' kmkocot.txt

David the H. 08-04-2012 12:13 AM

I think the OP made a good-faith effort, so here's how I would solve it. We use an array instead of a simple variable.

Code:

gawk 'BEGIN{ FS=OFS="\t" ; PROCINFO["sorted_in"]="@ind_num_asc" } { sum[$1]+=$2 } END{ for ( i in sum ){ print int( i ) , sum[i] } }' infile.txt
This relies on a feature found only in gnu awk version 4+, array sorting, so I called it specifically. On most Linuxes it should be the default awk.

Code:

BEGIN{ FS=$OFS="\t" ; PROCINFO["sorted_in"]="@ind_num_asc" }
This sets the input and output delimiters to tab, and gawk's (again v4+) internal array sorting to index-numeric-ascending. Otherwise the final output will be random in respect to the input.

http://www.gnu.org/software/gawk/man...y-Sorting.html

We could use the asorti function instead, but I find this way to be easier.
If you're using a version of awk that doesn't support sorting, then the easiest option is probably to just pipe the output through sort -n -k1 afterwards.

Code:

{ sum[$1]+=$2 }
Run through every line and store the values in an array, With indexes based on field 1. Every line that has the same $1 will have it's $2 value added to that entry. This is exactly like the "sum+=$2" the OP used, but allows for tracking multiple arbitrary values.

Code:

END{ for ( i in sum ){ print int( i ) , sum[i] } }
At the end of the file loop through the array. Print the index ( the $1 fields ), and the final value for that array entry. I used the int function on the i values to strip off the leading zeroes first. It's also possible to use printf and the %d tokens to format the output in the same way.

grail 08-04-2012 09:35 AM

How about:
Code:

awk -vidx="0007" '{sum[$1]+=$2}END{print sum[idx]}' file

Tinkster 08-05-2012 07:19 PM

Moved: This thread is more suitable in <programming> and has been moved accordingly to help your thread/question get the exposure it deserves.


All times are GMT -5. The time now is 02:43 AM.