Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
06-15-2008, 12:16 PM
|
#1
|
LQ Newbie
Registered: Jun 2008
Location: England
Posts: 21
Rep:
|
CSV file - Duplicate records need merging | BASH?
Hi,
I hope you can help.
I'm looking to take a CSV file which will have two fields:-
Customer Number
Balance
The file will have duplicate records which need merging along with the value of the second field (balance). In theory you could have the following file:
Code:
Customer Number, Balance
001, 40.00
002, 30.00
003, 50.00
004, 10.56
005, 32.01
001, 72.00
006, 11.98
003, 10.00
003, 111.11
007, 101.11
003, 18.11
I would need them merged into a separate CSV file so that only one record would be found for each customer number, this having the "balance" of the values sum'd.
So the above example would look like this:
Code:
Customer Number, Balance
001, 112.00
002, 30.00
003, 189.22
etc etc.....
I was looking at BASH commands to do this, I have had some experience with SORT and JOIN to do a similar process but not merging duplicate records.
Thanks for your help.
|
|
|
06-15-2008, 12:43 PM
|
#2
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
This awk code should do the trick:
Code:
BEGIN{ FS = ","; getline }
{
# populate array with customer as index and sum of balances as values
balance[$1] = balance[$1] + $2
}
END{ # sort the indices of the array and put them in array indices
n = asorti(balance,indices)
# print the sorted array
for (i = 1; i <= n; i++)
printf "%s, %5.2f\n", indices[i], balance[indices[i]]
}
|
|
|
06-15-2008, 12:45 PM
|
#3
|
HCL Maintainer
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450
Rep:
|
How about :
Code:
awk -F, '{a[$1]+=$2}END{for(val in a) printf("%s, %.2f\n", val, a[val])}'
Beat me to it.
Last edited by osor; 06-15-2008 at 12:46 PM.
|
|
|
06-16-2008, 08:02 AM
|
#4
|
LQ Newbie
Registered: Jun 2008
Location: England
Posts: 21
Original Poster
Rep:
|
Thats fantastic. Many thanks for the reply.
Would it be too much to ask just to explain a little as to what is going on in the syntax please? I assume the $ is not a variable but the field location?
|
|
|
06-16-2008, 08:34 AM
|
#5
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Quote:
Originally Posted by lmedland
Would it be too much to ask just to explain a little as to what is going on in the syntax please? I assume the $ is not a variable but the field location?
|
Yes. In awk the $1, $2, $3... refer to the field number, using FS (an internal built-in variable) as separator. In my version I put FS equal to comma inside the awk code, while in osor's version it is specified with the -F option.
In more details:
Code:
BEGIN{ FS = ","; getline }
this is the BEGIN section of awk, that is the code executed once at the beginning of the process. The getline statement serves to skip the header of the file.
Code:
balance[$1] = balance[$1] + $2
this assign values to the array "balance" using as index the customer number (first field) and as value the cumulated balance (second field). In awk indices of arrays can be any string, not only numeric.
Code:
n = asorti(balance,indices)
this sorts the indices of the array balance and put the sorted indices into a new array called "indices". This function is available since GNU awk 3.1.2. For older versions there are workarounds.
Code:
for (i = 1; i <= n; i++)
printf "%s, %5.2f\n", indices[i], balance[indices[i]]
This finally prints out the desired values using the format specified in the printf statement.
A must read for learning awk programming is the official guide GAWK: Effective AWK Programming.
|
|
|
06-16-2008, 09:06 AM
|
#6
|
LQ Newbie
Registered: Jun 2008
Location: England
Posts: 21
Original Poster
Rep:
|
Thank you. And...really stupid question..........
How do I run the above script?
|
|
|
06-16-2008, 09:20 AM
|
#7
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Save it in a file, e.g. sumbalance.awk, then run awk with -f option:
Code:
awk -f sumbalance.awk file
where file is the original file to process. The output is sent to standard output, but you can always redirect it to a file.
|
|
|
06-16-2008, 10:26 AM
|
#8
|
LQ Newbie
Registered: Jun 2008
Location: England
Posts: 21
Original Poster
Rep:
|
I'm getting
Quote:
line 14: function asorti never defined
|

|
|
|
06-16-2008, 10:55 AM
|
#9
|
HCL Maintainer
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450
Rep:
|
So you do not have GNU awk? In general, it is a pain to do sorting with nawk itself, but you can always pipe out:
Code:
awk 'BEGIN{FS = ",";getline}{balance[$1]+=$2}
END{for(val in balance) printf "%s, %.2f\n", val, balance[val]}' filename | sort | cat <(echo "Customer Number, Balance") -
(where filename is the file you want processed.)
Last edited by osor; 06-16-2008 at 11:09 AM.
|
|
|
06-16-2008, 11:04 AM
|
#10
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Maybe you have an older version of gawk, try
to find out. If you have version prior to 3.1.2 try the following
Code:
BEGIN{ FS = ","; getline }
{
# populate array with customer as index and sum of balances as values
balance[$1] = balance[$1] + $2
}
END{ # sort the indices of the array and put them in array indices
ii = 1
for (i in balance) {
indices[ii] = i
ii++
}
n = asort(indices)
# print the sorted array
for (i = 1; i <= n; i++)
printf "%s, %5.2f\n", indices[i], balance[indices[i]]
}
Edit: osor, beat me this time, but different solution 
Last edited by colucix; 06-16-2008 at 11:08 AM.
|
|
|
06-16-2008, 11:06 AM
|
#11
|
LQ Newbie
Registered: Jun 2008
Location: England
Posts: 21
Original Poster
Rep:
|
I had to install gawk but the script worked.
Thank you so much for your help!! 
|
|
|
08-12-2008, 09:15 AM
|
#12
|
LQ Newbie
Registered: Aug 2008
Posts: 4
Rep:
|
Merge and sum lines problem
I have a file containing this lines:
15 11Aug20080102 test.com/info/ zone1 test.com/
81 11Aug20080200 test.com/info/ zone1 test.com/
29 11Aug20080301 test.com/info/ zone1 test.com/
49 11Aug20080400 test.com/info/ zone1 test.com/
18 11Aug20080205 test.com/info/ zone1 test.com.mx/
23 11Aug20080316 test.com/info/ zone1 test.com.mx/
how can I merge this lines and sum the values on the first column but printing the first date ocurrence?
desired output:
174 11Aug20080102 test.com/info/ zone1 test.com/
41 11Aug20080102 test.com/info/ zone1 test.com.mx/
I just tried to modified the code from this post but I haven't done it work.
Cheers,
EM
|
|
|
08-12-2008, 12:50 PM
|
#13
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Not sure about the "first date occurrence" but you can assign the different fields to different arrays using the last field as index. When assigning the date you can previously check if the array element is null, then assign the date in field 2. I'm thinking about something like this:
Code:
{
sum[$NF] = sum[$NF] + $1
if ( date[$NF] == "" )
date[$NF] = $2
f3[$NF] = $3
f4[$NF] = $4
}
END{for (i in sum)
printf "%4d %s %s %s %s\n",sum[i],date[i],f3[i],f4[i],i
}
However, in you example you print out the first date in absolute, not the first date of each file listed in the last column. In that case you have to modify the code to not create the date array, but a single variable.
The output of the above code is:
Code:
$ awk -f test.awk file
174 11Aug20080102 test.com/info/ zone1 test.com/
41 11Aug20080205 test.com/info/ zone1 test.com.mx/
|
|
|
10-10-2008, 04:54 AM
|
#14
|
LQ Newbie
Registered: Oct 2008
Posts: 1
Rep:
|
Try This
{
{OFS = FS = ","}
x[$1] = x[$1]+$2
}
END{
for (i in x)
printf "%s, %5.2f\n", i,x[i]
}
|
|
|
12-09-2010, 05:13 AM
|
#15
|
Member
Registered: Nov 2004
Distribution: Slackware/Solaris
Posts: 952
Rep:
|
Hi,
Im trying to reuse this code on this same input file but under Solaris with gawk
Although the translation by profesor Colucix is done perfectly, I dont know how to modify it a bit.
The result I need is slight different:
The input file.txt:
Code:
Customer Number, Balance
001, 40.00
002, 30.00
003, 50.00
004, 10.56
005, 32.01
001, 72.00
006, 11.98
003, 10.00
003, 111.11
007, 101.11
003, 18.11
result:
Code:
001, 40.00, 72.00
002, 30.00
003, 50.00, 10.00, 111.11, 18.11
etc etc.....
Last edited by czezz; 12-09-2010 at 05:26 AM.
|
|
|
All times are GMT -5. The time now is 12:17 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|