LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   awk question on handling *.CSV "text fields" in awk (https://www.linuxquestions.org/questions/programming-9/awk-question-on-handling-%2A-csv-text-fields-in-awk-626706/)

jschiwal 03-08-2008 09:39 PM

awk question on handling *.CSV "text fields" in awk
 
I'm still more comfortable using sed than awk, and tried to reverse a solution I gave on another post.
Code:

cat example.csv
"Jim Jackson",100,10/12/04
"John Doe",150,10/12/08
"Jane Smith",50,10/13/08
 > awk -F, 'BEGIN {OFS="\n"; ORS="\n\n"} $1 ~ /^".*"$/ {$1 = gensub(/""/,"\"","g",gensub(/"(.*)"$/,"\\1","g",$1)); print $0 }' example.csv

Jim Jackson
100
10/12/04

John Doe
150
10/12/08

Jane Smith
50
10/13/08

I was wondering if there was a better, or more standard way in awk to remove the double quotes surrounding a text field, such as converting from some types of csv files to another form.

If you have an entry like: John "Skinny" Benson, oocalc will save a csv file with ""Skinny"" in the text field, so I modified the awk command a bit to handle this situation.

ghostdog74 03-09-2008 12:22 AM

with your sample,
Code:

# awk -F"," '{gsub(/"/,"",$1);print $1,$2,$3}' file
Jim Jackson 100 10/12/04
John Doe 150 10/12/08
Jane Smith 50 10/13/08

however, i don't think this is what you meant right? more of something like this perhaps? with commas in between double quotes? then it will be different story
Code:

"Jim , Jackson",100,10/12/04
"John , Doe",150,10/12/08
"Jane , Smith",50,10/13/08


jschiwal 03-09-2008 01:19 AM

No, the first field contains a name. This is a contrived example but I don't think that a persons name in a spread sheet column is something that would be very odd.

It occurred to me that such a table might contain nicknames which would naturally be enclosed in quotes, so I entered three lines in oocalc and saved it as a .csv file:
Code:

cat example2.csv
"Jim Smith",010,10/10/08
"John Doe",121,10/11/08
"James ""Skinny"" Johnson",023,10/12/08

Quote:

Originally Posted by from post#1
If you have an entry like: John "Skinny" Benson, oocalc will save a csv file with ""Skinny"" in the text field, so I modified the awk command a bit to handle this situation.

Notice that I transformed the input sample to a form that has one line per field and an empty line between records.
For the answer to someone elses post, I included an awk command to translate from this type of file structure to a csv file. After testing it, I thought that doing the opposite would be a good learning exercise.

You solution doesn't work for the third entry. However it did show me how to modify it to do things differently:
Code:

awk -F, 'BEGIN {OFS="\n"; ORS="\n\n"} $1 ~ /^".*"$/ {gsub(/^"/,"",$1);gsub(/"$/,"",$1);gsub(/""/,"\"",$1); print $1"\n"$2"\n"$3 }' example2.csv
I had been looking at gensub and substr and thinking of functions returning a string and doing something like F1(F2()). Your example shows me how to use a F1()F2()F3() form by acting on the field variable "$1" directly.

Having to assign one of the fields so that the OFS and ORS variables would be updated was something I hadn't realized before today.
compare:
Code:

> echo -e 'ab cd ef\ngh ij kl\nmn op qr\n' | awk 'BEGIN { OFS="," } {print }'
ab cd ef
gh ij kl
mn op qr

> echo -e 'ab cd ef\ngh ij kl\nmn op qr\n' | awk 'BEGIN { OFS="," } {$1=$1;print }'
ab,cd,ef
gh,ij,kl
mn,op,qr

Thanks for your help.

ghostdog74 03-09-2008 03:27 AM

no problem. i didn't help much either. just so you might want to know, gsub() takes in alternation (regexp) so similar subs can be shortened.
Code:

awk '{ gsub(/pat1|pat2|pat3/,"replacement" ) }' file

angrybanana 03-09-2008 02:54 PM

Your code doesn't won't work when any field other then the first is quoted, and you have to know the number of fields for it to work. This will get around both these problems, however its harder to work with this since it doesn't really split the file into meaningful records/fields.

Code:

gawk -F'""' '{gsub(/"/,"",$1);print $0(RT=="\n"?"\n":"")}' OFS='"' RS='"*,"*|\n' example.csv
If you just wanted a more condensed version of what you wrote, and don't care about the issues I mentioned above, then maybe something like this..
Code:

gawk -F'"*,"*|^"' '{gsub(/""/,"\"",$2);print $2,$3,$4}' OFS='\n' ORS='\n\n' example.csv
Anyways, if you're actually doing this to parse csv files (not just to learn awk) I recommend using something like python/perl/ruby, they will be a lot more powerful at handling this.

jschiwal 03-09-2008 06:25 PM

Thank you for the generalized solution. Yes, I am looking at this to learn more about AWK. I haven't gotten around to learning the perl language yet. When I use a flat file to store records, I use a tab separated file. The only real example I have is storing file backup information at work and then producing a pdf catalog from these files. This is using cygwin on Win2000.

Tinkster 04-17-2008 03:53 PM

jschiwal,

Can you please contact me via e-Mail?

Ta,
Tink

anantdhar 05-27-2010 06:03 AM

Quote:

Originally Posted by Tinkster (Post 3124520)
jschiwal,

Can you please contact me via e-Mail?

Ta,
Tink

i have a file that looks like

john
3
5
4
Mary
7
8
4
Bob
2
6
4


i need to add the integers individually so the output should be

john
12
Mary
19
Bob
12

i have a command that adds all of them but i need to get sum before it gets a character.

awk '{s+=$1} END {print s}' filename

Please help?

grail 05-27-2010 06:23 AM

Why not raise your own question instead of hijacking an ancient one???


All times are GMT -5. The time now is 11:39 PM.