LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   sum the third field of csv file ignoring the commas in double quotes (https://www.linuxquestions.org/questions/linux-newbie-8/sum-the-third-field-of-csv-file-ignoring-the-commas-in-double-quotes-4175431107/)

umix 10-08-2012 06:49 AM

sum the third field of csv file ignoring the commas in double quotes
 
(Source_file : first row header)
col1,col2,col3,col4
abc,"xy,1M",20,b
xyz,"ab,2N",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b

Output:
Sum of the third field=110


Thanks in advance

porphyry5 10-08-2012 08:43 AM

Quote:

Originally Posted by umix (Post 4800155)
(Source_file : first row header)
col1,col2,col3,col4
abc,"xy,1M",20,b
xyz,"ab,2N",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b

Output:
Sum of the third field=110


Thanks in advance

Code:

~ $ cat j.csv
col1,col2,col3,col4
abc,"xy,1M",20,b
xyz,"ab,2N",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b
~ $ awk 'BEGIN {FS=","; tot=0} {tot=tot+$4; next} END {print tot}' j.csv
110
~ $

To get literally "Sum of the third field=110" as output, the END statement should be
Code:

END {print "Sum of the third field="tot}

grail 10-08-2012 09:00 AM

My question would be if the 'col2' field could perhaps contains more / less data and hence the column being summed may not be the same, ie. that they all happen to be in the fourth comma
separated field at present, like
Code:

col1,col2,col3,col4
abc,"xy,1M,4m",20,b
xyz,"ab",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b


porphyry5 10-08-2012 09:22 AM

Quote:

Originally Posted by grail (Post 4800257)
My question would be if the 'col2' field could perhaps contains more / less data and hence the column being summed may not be the same, ie. that they all happen to be in the fourth comma
separated field at present, like
Code:

col1,col2,col3,col4
abc,"xy,1M,4m",20,b
xyz,"ab",25,b
fgh,"uv,1M",30,b
abc,"xy,1M",35,b


Code:

awk 'BEGIN {FS=","; tot=0} {tot=tot+$(NF-1); next} END {print tot}' j.csv
110

But that assumes there is always one more field in the line than the desired field, so 6 of one, half-dozen of the other.

grail 10-08-2012 09:27 AM

Quote:

But that assumes there is always one more field in the line than the desired field, so 6 of one, half-dozen of the other.
Well this also assumes the OP's query around ignoring the commas in double quotes is saying that quotes only appear in 'col2'.

I am curious though, why the use of 'next' in your script?

porphyry5 10-08-2012 11:16 AM

Quote:

Originally Posted by grail (Post 4800286)
I am curious though, why the use of 'next' in your script?

Because I'm not that certain of awk's default behavior, and I often state what to do next when it is unnecessary. I guess 'next' is only required when there are subsidiary statements one wishes to skip.

David the H. 10-08-2012 11:32 AM

This is the kind of problem gawk's FPAT feature was designed to handle:

http://www.gnu.org/software/gawk/man...y-Content.html

porphyry5 10-08-2012 02:14 PM

Quote:

Originally Posted by David the H. (Post 4800413)
This is the kind of problem gawk's FPAT feature was designed to handle:

http://www.gnu.org/software/gawk/man...y-Content.html

That feature is a recent addition to gawk, only in version 4 and up. I'm still using 3.1.8, and my system's man gawk doesn't even mention FPAT. But good to know, so when a final slackware 14 is released, I'll be able to use it.

colucix 10-08-2012 03:40 PM

For older versions of awk, assuming the double quotes are well balanced and you're not intersted in the content of the quoted fields, you can simply remove them and split the record by the remaining commas, e.g.
Code:

awk '{gsub(/"[^"]+"/,""); split($0,c,","); sum+=c[3]} END{print sum}' file
Anyway, awk is not the right tool to parse tricky CSV files (specifically you cannot build a regexp FS to exclude commas inside quotes). The perl or python modules for parsing CSV files are more suitable to accomplish these tasks.

umix 10-09-2012 01:53 AM

Thank you all for your inputs....It works and is very nice experience. I come to know UNIX is very vast and interesting subject to learn.


All times are GMT -5. The time now is 11:24 AM.