LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   CSV file - Duplicate records need merging | BASH? (https://www.linuxquestions.org/questions/programming-9/csv-file-duplicate-records-need-merging-%7C-bash-649452/)

lmedland 06-15-2008 12:16 PM

CSV file - Duplicate records need merging | BASH?
 
Hi,

I hope you can help.

I'm looking to take a CSV file which will have two fields:-

Customer Number
Balance

The file will have duplicate records which need merging along with the value of the second field (balance). In theory you could have the following file:

Code:

Customer Number, Balance
001, 40.00
002, 30.00
003, 50.00
004, 10.56
005, 32.01
001, 72.00
006, 11.98
003, 10.00
003, 111.11
007, 101.11
003, 18.11

I would need them merged into a separate CSV file so that only one record would be found for each customer number, this having the "balance" of the values sum'd.

So the above example would look like this:

Code:

Customer Number, Balance
001, 112.00
002, 30.00
003, 189.22
etc etc.....

I was looking at BASH commands to do this, I have had some experience with SORT and JOIN to do a similar process but not merging duplicate records.

Thanks for your help.

colucix 06-15-2008 12:43 PM

This awk code should do the trick:
Code:

BEGIN{ FS = ","; getline }
{
 # populate array with customer as index and sum of balances as values
 balance[$1] = balance[$1] + $2
}
END{ # sort the indices of the array and put them in array indices
    n = asorti(balance,indices)

    # print the sorted array
    for (i = 1; i <= n; i++)
        printf "%s, %5.2f\n", indices[i], balance[indices[i]]
}


osor 06-15-2008 12:45 PM

How about:
Code:

awk -F, '{a[$1]+=$2}END{for(val in a) printf("%s, %.2f\n", val, a[val])}'
Beat me to it.

lmedland 06-16-2008 08:02 AM

Thats fantastic. Many thanks for the reply.

Would it be too much to ask just to explain a little as to what is going on in the syntax please? I assume the $ is not a variable but the field location?

colucix 06-16-2008 08:34 AM

Quote:

Originally Posted by lmedland (Post 3186183)
Would it be too much to ask just to explain a little as to what is going on in the syntax please? I assume the $ is not a variable but the field location?

Yes. In awk the $1, $2, $3... refer to the field number, using FS (an internal built-in variable) as separator. In my version I put FS equal to comma inside the awk code, while in osor's version it is specified with the -F option.

In more details:
Code:

BEGIN{ FS = ","; getline }
this is the BEGIN section of awk, that is the code executed once at the beginning of the process. The getline statement serves to skip the header of the file.
Code:

balance[$1] = balance[$1] + $2
this assign values to the array "balance" using as index the customer number (first field) and as value the cumulated balance (second field). In awk indices of arrays can be any string, not only numeric.
Code:

n = asorti(balance,indices)
this sorts the indices of the array balance and put the sorted indices into a new array called "indices". This function is available since GNU awk 3.1.2. For older versions there are workarounds.
Code:

for (i = 1; i <= n; i++)
        printf "%s, %5.2f\n", indices[i], balance[indices[i]]

This finally prints out the desired values using the format specified in the printf statement.

A must read for learning awk programming is the official guide GAWK: Effective AWK Programming.

lmedland 06-16-2008 09:06 AM

Thank you. And...really stupid question..........

How do I run the above script?

colucix 06-16-2008 09:20 AM

Save it in a file, e.g. sumbalance.awk, then run awk with -f option:
Code:

awk -f sumbalance.awk file
where file is the original file to process. The output is sent to standard output, but you can always redirect it to a file.

lmedland 06-16-2008 10:26 AM

I'm getting

Quote:

line 14: function asorti never defined
:confused:

osor 06-16-2008 10:55 AM

So you do not have GNU awk? In general, it is a pain to do sorting with nawk itself, but you can always pipe out:
Code:

awk 'BEGIN{FS = ",";getline}{balance[$1]+=$2}
END{for(val in balance) printf "%s, %.2f\n", val, balance[val]}' filename | sort | cat <(echo "Customer Number, Balance") -

(where filename is the file you want processed.)

colucix 06-16-2008 11:04 AM

Maybe you have an older version of gawk, try
Code:

awk --version
to find out. If you have version prior to 3.1.2 try the following
Code:

BEGIN{ FS = ","; getline }
{
 # populate array with customer as index and sum of balances as values
 balance[$1] = balance[$1] + $2
}
END{ # sort the indices of the array and put them in array indices
    ii = 1
    for (i in balance) {
        indices[ii] = i
        ii++
    }
    n = asort(indices)
   
    # print the sorted array
    for (i = 1; i <= n; i++)
        printf "%s, %5.2f\n", indices[i], balance[indices[i]]
}

Edit: osor, beat me this time, but different solution ;)

lmedland 06-16-2008 11:06 AM

I had to install gawk but the script worked.

Thank you so much for your help!! :D

lalo4080 08-12-2008 09:15 AM

Merge and sum lines problem
 
I have a file containing this lines:

15 11Aug20080102 test.com/info/ zone1 test.com/
81 11Aug20080200 test.com/info/ zone1 test.com/
29 11Aug20080301 test.com/info/ zone1 test.com/
49 11Aug20080400 test.com/info/ zone1 test.com/
18 11Aug20080205 test.com/info/ zone1 test.com.mx/
23 11Aug20080316 test.com/info/ zone1 test.com.mx/

how can I merge this lines and sum the values on the first column but printing the first date ocurrence?

desired output:

174 11Aug20080102 test.com/info/ zone1 test.com/
41 11Aug20080102 test.com/info/ zone1 test.com.mx/

I just tried to modified the code from this post but I haven't done it work.

Cheers,
EM

colucix 08-12-2008 12:50 PM

Not sure about the "first date occurrence" but you can assign the different fields to different arrays using the last field as index. When assigning the date you can previously check if the array element is null, then assign the date in field 2. I'm thinking about something like this:
Code:

{
 sum[$NF] = sum[$NF] + $1
 if ( date[$NF] == "" )
    date[$NF] = $2
 f3[$NF] = $3
 f4[$NF] = $4
}
END{for (i in sum)
    printf "%4d %s %s %s %s\n",sum[i],date[i],f3[i],f4[i],i
}

However, in you example you print out the first date in absolute, not the first date of each file listed in the last column. In that case you have to modify the code to not create the date array, but a single variable.

The output of the above code is:
Code:

$ awk -f test.awk file
 174 11Aug20080102 test.com/info/ zone1 test.com/
  41 11Aug20080205 test.com/info/ zone1 test.com.mx/


raj_19_427 10-10-2008 04:54 AM

Try This
 
{
{OFS = FS = ","}

x[$1] = x[$1]+$2
}

END{
for (i in x)
printf "%s, %5.2f\n", i,x[i]
}

czezz 12-09-2010 05:13 AM

Hi,
Im trying to reuse this code on this same input file but under Solaris with gawk

Although the translation by profesor Colucix is done perfectly, I dont know how to modify it a bit.
The result I need is slight different:
The input file.txt:
Code:

Customer Number, Balance
001, 40.00
002, 30.00
003, 50.00
004, 10.56
005, 32.01
001, 72.00
006, 11.98
003, 10.00
003, 111.11
007, 101.11
003, 18.11

result:
Code:

001, 40.00, 72.00
002, 30.00
003, 50.00, 10.00, 111.11, 18.11
etc etc.....



All times are GMT -5. The time now is 08:23 PM.