LinuxQuestions.org - [SOLVED] sum the third field of csv file ignoring the commas in double quotes

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - sum the third field of csv file ignoring the commas in double quotes (https://www.linuxquestions.org/questions/linux-newbie-8/sum-the-third-field-of-csv-file-ignoring-the-commas-in-double-quotes-4175431107/)

umix	10-08-2012 06:49 AM

sum the third field of csv file ignoring the commas in double quotes

(Source_file : first row header)
col1,col2,col3,col4
abc,"xy,1M",20,b
xyz,"ab,2N",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b

Output:
Sum of the third field=110

Thanks in advance

porphyry5

10-08-2012 08:43 AM

Quote:

Originally Posted by umix (Post 4800155)

(Source_file : first row header)
col1,col2,col3,col4
abc,"xy,1M",20,b
xyz,"ab,2N",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b

Output:
Sum of the third field=110

Thanks in advance

Code:

~ $ cat j.csv

col1,col2,col3,col4

abc,"xy,1M",20,b

xyz,"ab,2N",25,b

fgh,"uv,1M",30,b

abc,"xy,1M",35,b

~ $ awk 'BEGIN {FS=","; tot=0} {tot=tot+$4; next} END {print tot}' j.csv

110

~ $

To get literally "Sum of the third field=110" as output, the END statement should be

Code:

END {print "Sum of the third field="tot}

grail

10-08-2012 09:00 AM

My question would be if the 'col2' field could perhaps contains more / less data and hence the column being summed may not be the same, ie. that they all happen to be in the fourth comma
separated field at present, like

Code:

col1,col2,col3,col4

abc,"xy,1M,4m",20,b

xyz,"ab",25,b

fgh,"uv,1M",30,b

abc,"xy,1M",35,b

porphyry5

10-08-2012 09:22 AM

Quote:

Originally Posted by grail (Post 4800257)

Code:

col1,col2,col3,col4

abc,"xy,1M,4m",20,b

xyz,"ab",25,b

fgh,"uv,1M",30,b

abc,"xy,1M",35,b

Code:

awk 'BEGIN {FS=","; tot=0} {tot=tot+$(NF-1); next} END {print tot}' j.csv

110

But that assumes there is always one more field in the line than the desired field, so 6 of one, half-dozen of the other.

grail

10-08-2012 09:27 AM

Quote:

But that assumes there is always one more field in the line than the desired field, so 6 of one, half-dozen of the other.

Well this also assumes the OP's query around ignoring the commas in double quotes is saying that quotes only appear in 'col2'.

I am curious though, why the use of 'next' in your script?

porphyry5

10-08-2012 11:16 AM

Quote:

Originally Posted by grail (Post 4800286)

I am curious though, why the use of 'next' in your script?

Because I'm not that certain of awk's default behavior, and I often state what to do next when it is unnecessary. I guess 'next' is only required when there are subsidiary statements one wishes to skip.

David the H.

10-08-2012 11:32 AM

This is the kind of problem gawk's FPAT feature was designed to handle:

http://www.gnu.org/software/gawk/man...y-Content.html

porphyry5

10-08-2012 02:14 PM

Quote:

Originally Posted by David the H. (Post 4800413)

This is the kind of problem gawk's FPAT feature was designed to handle:

http://www.gnu.org/software/gawk/man...y-Content.html

That feature is a recent addition to gawk, only in version 4 and up. I'm still using 3.1.8, and my system's man gawk doesn't even mention FPAT. But good to know, so when a final slackware 14 is released, I'll be able to use it.

colucix

10-08-2012 03:40 PM

For older versions of awk, assuming the double quotes are well balanced and you're not intersted in the content of the quoted fields, you can simply remove them and split the record by the remaining commas, e.g.

Code:

awk '{gsub(/"[^"]+"/,""); split($0,c,","); sum+=c[3]} END{print sum}' file

Anyway, awk is not the right tool to parse tricky CSV files (specifically you cannot build a regexp FS to exclude commas inside quotes). The perl or python modules for parsing CSV files are more suitable to accomplish these tasks.

umix	10-09-2012 01:53 AM

Thank you all for your inputs....It works and is very nice experience. I come to know UNIX is very vast and interesting subject to learn.

All times are GMT -5. The time now is 11:24 AM.