LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-15-2008, 12:16 PM   #1
lmedland
LQ Newbie
 
Registered: Jun 2008
Location: England
Posts: 21

Rep: Reputation: 15
Question CSV file - Duplicate records need merging | BASH?


Hi,

I hope you can help.

I'm looking to take a CSV file which will have two fields:-

Customer Number
Balance

The file will have duplicate records which need merging along with the value of the second field (balance). In theory you could have the following file:

Code:
Customer Number, Balance
001, 40.00
002, 30.00
003, 50.00
004, 10.56
005, 32.01
001, 72.00
006, 11.98
003, 10.00
003, 111.11
007, 101.11
003, 18.11
I would need them merged into a separate CSV file so that only one record would be found for each customer number, this having the "balance" of the values sum'd.

So the above example would look like this:

Code:
Customer Number, Balance
001, 112.00
002, 30.00
003, 189.22
etc etc.....
I was looking at BASH commands to do this, I have had some experience with SORT and JOIN to do a similar process but not merging duplicate records.

Thanks for your help.
 
Old 06-15-2008, 12:43 PM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984
This awk code should do the trick:
Code:
BEGIN{ FS = ","; getline }
{
 # populate array with customer as index and sum of balances as values
 balance[$1] = balance[$1] + $2
}
END{ # sort the indices of the array and put them in array indices
     n = asorti(balance,indices)

     # print the sorted array
     for (i = 1; i <= n; i++)
         printf "%s, %5.2f\n", indices[i], balance[indices[i]]
}
 
Old 06-15-2008, 12:45 PM   #3
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 79
How about:
Code:
awk -F, '{a[$1]+=$2}END{for(val in a) printf("%s, %.2f\n", val, a[val])}'
Beat me to it.

Last edited by osor; 06-15-2008 at 12:46 PM.
 
Old 06-16-2008, 08:02 AM   #4
lmedland
LQ Newbie
 
Registered: Jun 2008
Location: England
Posts: 21

Original Poster
Rep: Reputation: 15
Question

Thats fantastic. Many thanks for the reply.

Would it be too much to ask just to explain a little as to what is going on in the syntax please? I assume the $ is not a variable but the field location?
 
Old 06-16-2008, 08:34 AM   #5
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984
Quote:
Originally Posted by lmedland View Post
Would it be too much to ask just to explain a little as to what is going on in the syntax please? I assume the $ is not a variable but the field location?
Yes. In awk the $1, $2, $3... refer to the field number, using FS (an internal built-in variable) as separator. In my version I put FS equal to comma inside the awk code, while in osor's version it is specified with the -F option.

In more details:
Code:
BEGIN{ FS = ","; getline }
this is the BEGIN section of awk, that is the code executed once at the beginning of the process. The getline statement serves to skip the header of the file.
Code:
balance[$1] = balance[$1] + $2
this assign values to the array "balance" using as index the customer number (first field) and as value the cumulated balance (second field). In awk indices of arrays can be any string, not only numeric.
Code:
n = asorti(balance,indices)
this sorts the indices of the array balance and put the sorted indices into a new array called "indices". This function is available since GNU awk 3.1.2. For older versions there are workarounds.
Code:
for (i = 1; i <= n; i++)
         printf "%s, %5.2f\n", indices[i], balance[indices[i]]
This finally prints out the desired values using the format specified in the printf statement.

A must read for learning awk programming is the official guide GAWK: Effective AWK Programming.
 
Old 06-16-2008, 09:06 AM   #6
lmedland
LQ Newbie
 
Registered: Jun 2008
Location: England
Posts: 21

Original Poster
Rep: Reputation: 15
Question

Thank you. And...really stupid question..........

How do I run the above script?
 
Old 06-16-2008, 09:20 AM   #7
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984
Save it in a file, e.g. sumbalance.awk, then run awk with -f option:
Code:
awk -f sumbalance.awk file
where file is the original file to process. The output is sent to standard output, but you can always redirect it to a file.
 
Old 06-16-2008, 10:26 AM   #8
lmedland
LQ Newbie
 
Registered: Jun 2008
Location: England
Posts: 21

Original Poster
Rep: Reputation: 15
Question

I'm getting

Quote:
line 14: function asorti never defined
 
Old 06-16-2008, 10:55 AM   #9
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 79
So you do not have GNU awk? In general, it is a pain to do sorting with nawk itself, but you can always pipe out:
Code:
awk 'BEGIN{FS = ",";getline}{balance[$1]+=$2}
END{for(val in balance) printf "%s, %.2f\n", val, balance[val]}' filename | sort | cat <(echo "Customer Number, Balance") -
(where filename is the file you want processed.)

Last edited by osor; 06-16-2008 at 11:09 AM.
 
Old 06-16-2008, 11:04 AM   #10
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984
Maybe you have an older version of gawk, try
Code:
awk --version
to find out. If you have version prior to 3.1.2 try the following
Code:
BEGIN{ FS = ","; getline }
{
 # populate array with customer as index and sum of balances as values
 balance[$1] = balance[$1] + $2
}
END{ # sort the indices of the array and put them in array indices
     ii = 1
     for (i in balance) {
         indices[ii] = i
         ii++
     }
     n = asort(indices)
     
     # print the sorted array
     for (i = 1; i <= n; i++)
         printf "%s, %5.2f\n", indices[i], balance[indices[i]]
}
Edit: osor, beat me this time, but different solution

Last edited by colucix; 06-16-2008 at 11:08 AM.
 
Old 06-16-2008, 11:06 AM   #11
lmedland
LQ Newbie
 
Registered: Jun 2008
Location: England
Posts: 21

Original Poster
Rep: Reputation: 15
I had to install gawk but the script worked.

Thank you so much for your help!!
 
Old 08-12-2008, 09:15 AM   #12
lalo4080
LQ Newbie
 
Registered: Aug 2008
Posts: 4

Rep: Reputation: 0
Merge and sum lines problem

I have a file containing this lines:

15 11Aug20080102 test.com/info/ zone1 test.com/
81 11Aug20080200 test.com/info/ zone1 test.com/
29 11Aug20080301 test.com/info/ zone1 test.com/
49 11Aug20080400 test.com/info/ zone1 test.com/
18 11Aug20080205 test.com/info/ zone1 test.com.mx/
23 11Aug20080316 test.com/info/ zone1 test.com.mx/

how can I merge this lines and sum the values on the first column but printing the first date ocurrence?

desired output:

174 11Aug20080102 test.com/info/ zone1 test.com/
41 11Aug20080102 test.com/info/ zone1 test.com.mx/

I just tried to modified the code from this post but I haven't done it work.

Cheers,
EM
 
Old 08-12-2008, 12:50 PM   #13
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984Reputation: 1984
Not sure about the "first date occurrence" but you can assign the different fields to different arrays using the last field as index. When assigning the date you can previously check if the array element is null, then assign the date in field 2. I'm thinking about something like this:
Code:
{
 sum[$NF] = sum[$NF] + $1
 if ( date[$NF] == "" )
    date[$NF] = $2
 f3[$NF] = $3
 f4[$NF] = $4
}
END{for (i in sum)
     printf "%4d %s %s %s %s\n",sum[i],date[i],f3[i],f4[i],i
}
However, in you example you print out the first date in absolute, not the first date of each file listed in the last column. In that case you have to modify the code to not create the date array, but a single variable.

The output of the above code is:
Code:
$ awk -f test.awk file
 174 11Aug20080102 test.com/info/ zone1 test.com/
  41 11Aug20080205 test.com/info/ zone1 test.com.mx/
 
Old 10-10-2008, 04:54 AM   #14
raj_19_427
LQ Newbie
 
Registered: Oct 2008
Posts: 1

Rep: Reputation: 0
Try This

{
{OFS = FS = ","}

x[$1] = x[$1]+$2
}

END{
for (i in x)
printf "%s, %5.2f\n", i,x[i]
}
 
Old 12-09-2010, 05:13 AM   #15
czezz
Member
 
Registered: Nov 2004
Distribution: Slackware/Solaris
Posts: 952

Rep: Reputation: 44
Hi,
Im trying to reuse this code on this same input file but under Solaris with gawk

Although the translation by profesor Colucix is done perfectly, I dont know how to modify it a bit.
The result I need is slight different:
The input file.txt:
Code:
Customer Number, Balance
001, 40.00
002, 30.00
003, 50.00
004, 10.56
005, 32.01
001, 72.00
006, 11.98
003, 10.00
003, 111.11
007, 101.11
003, 18.11
result:
Code:
001, 40.00, 72.00
002, 30.00
003, 50.00, 10.00, 111.11, 18.11
etc etc.....

Last edited by czezz; 12-09-2010 at 05:26 AM.
 
  


Reply

Tags
awk, bash, csv, duplicate, merge, records


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
print duplicate records only vakharia Mahesh Linux - General 1 05-24-2007 02:19 PM
bash script - remove header row from csv file pljvaldez Programming 5 08-30-2006 11:05 AM
Bash - Deleting duplicate records Wire323 Programming 5 12-04-2005 08:51 AM
Detecting duplicate keys in records. carl.waldbieser Programming 15 09-15-2005 06:24 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:17 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration