LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   CSV | GAWK | Record merge problem! (http://www.linuxquestions.org/questions/programming-9/csv-%7C-gawk-%7C-record-merge-problem-659045/)

lmedland 07-29-2008 11:35 AM

CSV | GAWK | Record merge problem!
 
Hi,

I'm looking to take a CSV file which will have three fields:-

Customer Number
Balance
Payor ID

The file will have duplicate records which need merging along with the value of the second field (balance).

In theory you could have the following file:

Code:

customer, balance, payor
001, 200.00, Microsoft
002, 100.00, Apple
002, 100.00, Apple
003, 800.00, IBM
004, 250.00, Microsoft
001, 100.00, Microsoft

Please see my previous post and the excellent answers I got. This worked great with 2 fields but I need the third field (Payor ID) to be printf out with the results so it would like this..

Code:

customer, balance, payor
001, 300.00, Microsoft
002, 200.00, Apple
003, 800.00, IBM
004, 250.00, Microsoft

In cases where there is a different payor ID for the same customer id a "XXX" would need to be listed instead of a payor...

Code:

005, 500.00, XXX
Here is my current code which only printf's the two fields...

Code:

BEGIN{ FS = ","; getline }
{
 # populate array with customer as index and sum of balances as values
 balance[$1] = balance[$1] + $2
}
END{ # sort the indices of the array and put them in array indices
    n = asorti(balance,indices)

    # print the sorted array
    for (i = 1; i <= n; i++)
        printf "%s, %5.2f\n", indices[i], balance[indices[i]]
}

I have tried all permutations to get this to work and read the man page for awk many times so I'm desperate for some help.

Many thanks!

radoulov 07-29-2008 01:57 PM

Trying to avoid GNU Awk only features:

Code:

awk -F, 'END {
  while (++i <= c)   
    printf "%s,%.2f,%s\n",\
    p[i], s[p[i]], t[p[i]] 
  }
NR == 1 {
  print
  next
  }
{
  if (!_[$1]++) {
    p[++c] = $1
    t[$1] = $3
    }
  if (_[$1] > 1 && $3 != t[$1])
    t[$1] = " XXX" 
  s[$1] += $2
  }' filename

The above code tries to preserve the original order of the field1.

With GNU Awk:

Code:

awk -F, 'NR == 1 { print; next }
{ r[$1] += $2; t[$1] = (_[$1] ++ && $3 != t[$1]) ? " XXX" : $3 }
END { n = asorti(r, idx)
while (++i <= n)
  printf "%s, %.2f,%s\n", idx[i], r[idx[i]], t[idx[i]]
  }' filename


lmedland 07-30-2008 06:08 AM

Thanks radoulov that worked great.:cool:

However, is it possible where there are different payor ID's for the same customer id instead of the XXX could the routine prinf a new line for each payor ID associated with a customer?

So in the example below:-

Code:

customer, balance, payor
001, 200.00, Microsoft
002, 100.00, Apple
002, 100.00, Apple
003, 800.00, IBM
004, 250.00, Microsoft
001, 100.00, Microsoft
001, 100.00, 001
001, 100.00, Apple

You would get as a result:-

Code:

customer, balance, payor
001, 300.00, Microsoft
001, 100.00, 001
001, 100.00, Apple
002, 200.00, Apple
003, 800.00, IBM
004, 250.00, Microsoft

I hop this makes sense, sorry to mess you around but I thought it wouldn't be possible to do this but perhaps it is?

Thank you.:confused:

radoulov 07-30-2008 06:58 AM

Try this, I don't know if the payor order matters:

[GNU Awk]

Code:

awk -F', *' 'NR == 1 { print; next }
{ _[$1, $3] += $2 }
END {
  n = asorti(_, __)
  while (++i <= n) {
    split(__[i], t, SUBSEP)
    printf "%s, %.2f, %s\n",
      t[1], _[__[i]], t[2]   
      }
    }' filename


lmedland 07-30-2008 08:10 AM

Quote:

Originally Posted by radoulov (Post 3230850)
Try this, I don't know if the payor order matters:

[GNU Awk]

Code:

awk -F', *' 'NR == 1 { print; next }
{ _[$1, $3] += $2 }
END {
  n = asorti(_, __)
  while (++i <= n) {
    split(__[i], t, SUBSEP)
    printf "%s, %.2f, %s\n",
      t[1], _[__[i]], t[2]   
      }
    }' filename


You are most excellent! Thank you!!


All times are GMT -5. The time now is 01:51 PM.