Sum up value of columns in 4 files using shell script

reach.sree@gmail.com · 06-06-2012, 07:41 PM

Sorry my earlier question might not have been clear enough. Hoping this is clearer

I have records like below in 4 different files which have about 10000 records each, all records unique and sorted based on column 2.

1 2 3 4 5 6 7 8
-------------------------------------------------
SR|1010478|000044590|Montfort Dr|Dallas|4|1|2|
SR|1014759|000105790|Holmes Dr|Connecticut|2|2|3|
SR|1016609|000108901|Mark Dr|Houston|1|3|2|
SR|1017503|000111262|Preston Rd|Chicago|3|4|1|

1 2 3 4 5 6 7 8
----------------------------------------------
SR|1010478|000044590|Montfort Dr|Dallas|3|1|1|
SR|1014759|000105790|Holmes Dr|Connecticut|2|2|2|
SR|1016609|000108901|Mark Dr|Houston|1|3|3|
SR|1017503|000111262|Preston Rd|Chicago|3|4|0|

I want to get column 2 of each file(like 1010478) and sum up values of columns 6,7,8 of each file and the whole process should not take more than 15-30 min. So my output should be a different file with records as below.

Total record count = 4
1 2 3 4 5 6 7 8
---------------------------
SR|1010478|000044590|Montfort Dr|Dallas|7|2|3|
SR|1014759|000105790|Holmes Dr|Connecticut|4|4|5|
SR|1016609|000108901|Mark Dr|Houston|2|6|5|
SR|1017503|000111262|Preston Rd|Chicago|6|8|1|
Can someone help me with this

pixellany · 06-06-2012, 08:20 PM

Homework?

Regardless, I'm not following what you want----eg what exactly is column 4 in the second group? Is it maybe the number of occurrences of the string in column 2?

to get the number of occurences of a pattern in a file, you can do this:

Code:

grep 'pattern' filename | wc -l

(this assumes one occurence of the pattern per line)

reach.sree@gmail.com · 06-06-2012, 08:59 PM

Hi pixellany,

Thanks for the prompt reply. It is not the number of occurrences of string in column 2. the string in column 2 has a count in column 6. I want to get the string in column 2 find it in all 4 files and sum the counts in column 6. Please look at the edited question for more clarity.

regards

pixellany · 06-06-2012, 09:23 PM

Let me play it back to see if I understand.

There are 4 files. Each one has a series of entries in the format shown. Find all the entries with the same pattern in field #2,and then add up the numerical values in the last 3 fields.

Assuming this is correct, then here is how I would start:

concatenate (combine) the files into one file
sort on the 2nd field
in a loop:
---extract the values in the last three fields
---move to the next line and see if field 2 is still the same---if so, add the values in the last three fields to the running total
---when field 2 changes, then write out the desired output and then start accumulating again

Actually, it may wind up being at least two nested loops---the inner loop runs for each new value in field 2

Nominal Animal · 06-06-2012, 10:33 PM

Forty thousand records should not take more than a few seconds using awk. Save the following as say sum678:

Code:

#!/usr/bin/awk -f
# Setup rule:
BEGIN {

    # Accept any newline convention. Each line is a record.
    RS = "(\r\n|\n\r|\r|\n)"

    # Fields are separated by pipe characters.
    FS = "|"

    # No keys known yet.
    keys = 0
}

# Rule applied to each record:
{
    # Second field is the unique key.
    k = $2

    if (k in key) {
        # Already known key. Just sum.
        sum6[k] += $6
        sum7[k] += $7
        sum8[k] += $8
    } else {
        # Add new key into the key array (indexed 1 to keys),
        key[++keys] = k
        # but the data in associative arrays under that key.
        common[k] = $1 "|" $2 "|" $3 "|" $4 "|" $5
        sum6[k] = $6
        sum7[k] = $7
        sum8[k] = $8
    }
}

# Final rule:
END {
    for (i = 1; i <= keys; i++) {
        k = key[i]
        printf("%s|%.0f|%.0f|%.0f|\n", common[k], sum6[k], sum7[k], sum8[k])
    }
}

Make it executable using chmod a+rx sum678, and run it:

Code:

./sum678 file1 file2 ... fileN > results

Note that the sums are output as rounded to nearest integers. If you want general numeric format, use %g instead of %.0f. For other formatting options, check out printf-related parts in the GNU Awk User's Manual.

reach.sree@gmail.com · 06-07-2012, 05:48 PM

Thanks all. I went with
awk -F"|" '{a[$2]+=$6;b[$2]=$1 FS $2 FS $3 FS $4 FS $5 FS a[$2] FS} END {for ( i in b) { print b[i]}}' file1 file2 file3 file4

bigearsbilly · 06-07-2012, 07:29 PM

that's pretty ugly, I'd make you do it again