LinuxQuestions.org - awk question on handling *.CSV "text fields" in awk

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - awk question on handling *.CSV "text fields" in awk (https://www.linuxquestions.org/questions/programming-9/awk-question-on-handling-%2A-csv-text-fields-in-awk-626706/)

awk question on handling *.CSV "text fields" in awk

I'm still more comfortable using sed than awk, and tried to reverse a solution I gave on another post.

Code:

cat example.csv

"Jim Jackson",100,10/12/04

"John Doe",150,10/12/08

"Jane Smith",50,10/13/08

 > awk -F, 'BEGIN {OFS="\n"; ORS="\n\n"} $1 ~ /^".*"$/ {$1 = gensub(/""/,"\"","g",gensub(/"(.*)"$/,"\\1","g",$1)); print $0 }' example.csv



Jim Jackson

100

10/12/04



John Doe

150

10/12/08



Jane Smith

50

10/13/08

I was wondering if there was a better, or more standard way in awk to remove the double quotes surrounding a text field, such as converting from some types of csv files to another form.

If you have an entry like: John "Skinny" Benson, oocalc will save a csv file with ""Skinny"" in the text field, so I modified the awk command a bit to handle this situation.

with your sample,

Code:

# awk -F"," '{gsub(/"/,"",$1);print $1,$2,$3}' file

Jim Jackson 100 10/12/04

John Doe 150 10/12/08

Jane Smith 50 10/13/08

however, i don't think this is what you meant right? more of something like this perhaps? with commas in between double quotes? then it will be different story

Code:

"Jim , Jackson",100,10/12/04

"John , Doe",150,10/12/08

"Jane , Smith",50,10/13/08

No, the first field contains a name. This is a contrived example but I don't think that a persons name in a spread sheet column is something that would be very odd.

It occurred to me that such a table might contain nicknames which would naturally be enclosed in quotes, so I entered three lines in oocalc and saved it as a .csv file:

Code:

cat example2.csv

"Jim Smith",010,10/10/08

"John Doe",121,10/11/08

"James ""Skinny"" Johnson",023,10/12/08

Quote:

Originally Posted by from post#1

If you have an entry like: John "Skinny" Benson, oocalc will save a csv file with ""Skinny"" in the text field, so I modified the awk command a bit to handle this situation.

Notice that I transformed the input sample to a form that has one line per field and an empty line between records.
For the answer to someone elses post, I included an awk command to translate from this type of file structure to a csv file. After testing it, I thought that doing the opposite would be a good learning exercise.

You solution doesn't work for the third entry. However it did show me how to modify it to do things differently:

Code:

awk -F, 'BEGIN {OFS="\n"; ORS="\n\n"} $1 ~ /^".*"$/ {gsub(/^"/,"",$1);gsub(/"$/,"",$1);gsub(/""/,"\"",$1); print $1"\n"$2"\n"$3 }' example2.csv

I had been looking at gensub and substr and thinking of functions returning a string and doing something like F1(F2()). Your example shows me how to use a F1()F2()F3() form by acting on the field variable "$1" directly.

Having to assign one of the fields so that the OFS and ORS variables would be updated was something I hadn't realized before today.
compare:

Code:

> echo -e 'ab cd ef\ngh ij kl\nmn op qr\n' | awk 'BEGIN { OFS="," } {print }'

ab cd ef

gh ij kl

mn op qr



> echo -e 'ab cd ef\ngh ij kl\nmn op qr\n' | awk 'BEGIN { OFS="," } {$1=$1;print }'

ab,cd,ef

gh,ij,kl

mn,op,qr

Thanks for your help.

no problem. i didn't help much either. just so you might want to know, gsub() takes in alternation (regexp) so similar subs can be shortened.

Code:

awk '{ gsub(/pat1|pat2|pat3/,"replacement" ) }' file

Your code doesn't won't work when any field other then the first is quoted, and you have to know the number of fields for it to work. This will get around both these problems, however its harder to work with this since it doesn't really split the file into meaningful records/fields.

Code:

gawk -F'""' '{gsub(/"/,"",$1);print $0(RT=="\n"?"\n":"")}' OFS='"' RS='"*,"*|\n' example.csv

If you just wanted a more condensed version of what you wrote, and don't care about the issues I mentioned above, then maybe something like this..

Code:

gawk -F'"*,"*|^"' '{gsub(/""/,"\"",$2);print $2,$3,$4}' OFS='\n' ORS='\n\n' example.csv

Anyways, if you're actually doing this to parse csv files (not just to learn awk) I recommend using something like python/perl/ruby, they will be a lot more powerful at handling this.

Thank you for the generalized solution. Yes, I am looking at this to learn more about AWK. I haven't gotten around to learning the perl language yet. When I use a flat file to store records, I use a tab separated file. The only real example I have is storing file backup information at work and then producing a pdf catalog from these files. This is using cygwin on Win2000.

jschiwal,

Can you please contact me via e-Mail?

Ta,
Tink

Quote:

Originally Posted by Tinkster (Post 3124520)

jschiwal,

Can you please contact me via e-Mail?

Ta,
Tink

i have a file that looks like

john
3
5
4
Mary
7
8
4
Bob
2
6
4

i need to add the integers individually so the output should be

john
12
Mary
19
Bob
12

i have a command that adds all of them but i need to get sum before it gets a character.

awk '{s+=$1} END {print s}' filename

Please help?

Why not raise your own question instead of hijacking an ancient one???