[SOLVED] sum the third field of csv file ignoring the commas in double quotes

umix · 10-08-2012, 06:49 AM

(Source_file : first row header)
col1,col2,col3,col4
abc,"xy,1M",20,b
xyz,"ab,2N",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b

Output:
Sum of the third field=110

Thanks in advance

porphyry5 · 10-08-2012, 08:43 AM

Quote:

Originally Posted by umix

(Source_file : first row header)
col1,col2,col3,col4
abc,"xy,1M",20,b
xyz,"ab,2N",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b

Output:
Sum of the third field=110

Thanks in advance

Code:

~ $ cat j.csv
col1,col2,col3,col4
abc,"xy,1M",20,b
xyz,"ab,2N",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b
~ $ awk 'BEGIN {FS=","; tot=0} {tot=tot+$4; next} END {print tot}' j.csv
110
~ $

To get literally "Sum of the third field=110" as output, the END statement should be

Code:

END {print "Sum of the third field="tot}

grail · 10-08-2012, 09:00 AM

My question would be if the 'col2' field could perhaps contains more / less data and hence the column being summed may not be the same, ie. that they all happen to be in the fourth comma
separated field at present, like

Code:

col1,col2,col3,col4
abc,"xy,1M,4m",20,b
xyz,"ab",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b

porphyry5 · 10-08-2012, 09:22 AM

Quote:

Originally Posted by grail

My question would be if the 'col2' field could perhaps contains more / less data and hence the column being summed may not be the same, ie. that they all happen to be in the fourth comma
separated field at present, like

Code:

col1,col2,col3,col4
abc,"xy,1M,4m",20,b
xyz,"ab",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b

Code:

awk 'BEGIN {FS=","; tot=0} {tot=tot+$(NF-1); next} END {print tot}' j.csv
110

But that assumes there is always one more field in the line than the desired field, so 6 of one, half-dozen of the other.

grail · 10-08-2012, 09:27 AM

Quote:

But that assumes there is always one more field in the line than the desired field, so 6 of one, half-dozen of the other.

Well this also assumes the OP's query around ignoring the commas in double quotes is saying that quotes only appear in 'col2'.

I am curious though, why the use of 'next' in your script?

porphyry5 · 10-08-2012, 11:16 AM

Quote:

Originally Posted by grail

I am curious though, why the use of 'next' in your script?

Because I'm not that certain of awk's default behavior, and I often state what to do next when it is unnecessary. I guess 'next' is only required when there are subsidiary statements one wishes to skip.

David the H. · 10-08-2012, 11:32 AM

This is the kind of problem gawk's FPAT feature was designed to handle:

http://www.gnu.org/software/gawk/man...y-Content.html

porphyry5 · 10-08-2012, 02:14 PM

Quote:

Originally Posted by David the H.

This is the kind of problem gawk's FPAT feature was designed to handle:

http://www.gnu.org/software/gawk/man...y-Content.html

That feature is a recent addition to gawk, only in version 4 and up. I'm still using 3.1.8, and my system's man gawk doesn't even mention FPAT. But good to know, so when a final slackware 14 is released, I'll be able to use it.

colucix · 10-08-2012, 03:40 PM

For older versions of awk, assuming the double quotes are well balanced and you're not intersted in the content of the quoted fields, you can simply remove them and split the record by the remaining commas, e.g.

Code:

awk '{gsub(/"[^"]+"/,""); split($0,c,","); sum+=c[3]} END{print sum}' file

Anyway, awk is not the right tool to parse tricky CSV files (specifically you cannot build a regexp FS to exclude commas inside quotes). The perl or python modules for parsing CSV files are more suitable to accomplish these tasks.

umix · 10-09-2012, 01:53 AM

Thank you all for your inputs....It works and is very nice experience. I come to know UNIX is very vast and interesting subject to learn.