LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-01-2016, 03:48 PM   #1
CJenR
LQ Newbie
 
Registered: Jun 2016
Posts: 5

Rep: Reputation: Disabled
CSV file - Fetch column value which has commas - using awk


I have been trying to retrieve column values from CSV files using shell script.

My CSV file record is as follows:
1234,some text,"some text, with comma",text,alphanumeric,"again with, comma",text

My code: awk -F"," '{if(NR == 3) print $3}' FileName

I used comma as the field separator, but it does not get me the third column data. Can someone please help me
 
Old 06-01-2016, 04:48 PM   #2
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,137

Rep: Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263
Sure it is. Your third field is:

Code:
"some text
If you don't want it to consider commas inside quotes you need to use a different tool than regex, or else you need to preprocess the file before passing it to awk.

Edit: The reason you can't use regular expressions is that further down in your file is going to be:

Code:
1234,some text,"some text, with \"excaped, quotes\" and commas",text,alphanumeric,"again with, comma",text

Last edited by smallpond; 06-01-2016 at 04:55 PM.
 
Old 06-01-2016, 04:53 PM   #3
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,774

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
That turns out to be a non-trivial task, as you have found. Here is a function taken from an awk script that I wrote. I believe it handles all cases of quoted strings, escaped literal quote characters, embedded quoted newlines, etc., properly. It might need a llittle tweaking to work outside of the complete script where I use it. In particular, I notice that the variable stderr is never set to the value "/dev/stderr".
Code:
####
# parser(string,delim): Parse a CSV string with a given delimiter
# into a global fields[] array and return the number of fields found.
#
# The delimiter defaults to "," if not specified. Any leading or
# trailing spaces are conidered part of the field.
#
# Released to the public domain 1 June 2016 by Robert Nichols.
####
function parser(s,delim,   nc,aa,n,argnum,val,quoted) {
    argnum = 0; val = ""; quoted = 0
    if(delim == "")  delim=","
    nc = split(s, aa, "")
    for(n = 1; n <= nc+1; ++n) {
        if(n > nc || aa[n] == delim) {
            if(quoted) {
                if(n > nc) {    # quoted newline
                    val = val "\n"
                    if((getline s) < 1) {
                        print "EOF with unterminated quote" >stderr
                        fields[++argnum] = val
                        Errflg = 1
                        break
                    }
                    nc = split(s, aa, "")
                    n = 0
                    continue
                }
            }
            else {
                fields[++argnum] = val
                val = ""
                continue
            }
        }
        else if(aa[n] == "\"") {
            if(!quoted) { ++quoted; continue }
            if(aa[n+1] == "\"")  ++n    # literal quote char
            else { quoted = 0; continue }
        }
        val = val aa[n]
    }
    return argnum
}

Last edited by rknichols; 06-01-2016 at 04:57 PM.
 
2 members found this post helpful.
Old 06-01-2016, 05:46 PM   #4
CJenR
LQ Newbie
 
Registered: Jun 2016
Posts: 5

Original Poster
Rep: Reputation: Disabled
Yes, I want it to consider commas inside quotes.

@rknichols: I will look into your code. But just checking if there are simpler ways to do it.

Is it possible to preprocess it line by line and apply awk? or I need to preprocess the whole file (if so, Can I convert my existing csv file's field separator to be pipe (|) instead of comma). I have some 4000 records in the file so which is better?

I tried to use gsub:
Code:
awk -F '"' '{NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}}NF==1{gsub(",","|")}1' OFS='"' FileName
This gives me the expected file data as pipe separated but I want to know if I can get specific column data for each row using awk.
 
Old 06-01-2016, 06:04 PM   #5
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 6,552

Rep: Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872Reputation: 872
If using gnu awk:
http://www.gnu.org/software/gawk/man...y-Content.html
 
Old 06-01-2016, 06:05 PM   #6
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,367

Rep: Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747Reputation: 2747
Have you tried using the FPAT variable? http://www.gnu.org/software/gawk/man...ing-By-Content

Edit: Too slow!
 
Old 06-01-2016, 09:10 PM   #7
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,774

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
Quote:
Originally Posted by CJenR View Post
Yes, I want it to consider commas inside quotes.

@rknichols: I will look into your code. But just checking if there are simpler ways to do it.

Is it possible to preprocess it line by line and apply awk? or I need to preprocess the whole file (if so, Can I convert my existing csv file's field separator to be pipe (|) instead of comma). I have some 4000 records in the file so which is better?
You can process line by line as long as there are no embedded newline characters in the quoted strings. Since awk already works line by line, that will be no easier or harder than processing the whole file. Changing the field separator without affecting the commas in quoted strings will be just as hard as parsing the file as is. The problem is the quoted strings, which could potentially contain anything, including literal double quote chatracters.
Quote:
I tried to use gsub:
Code:
awk -F '"' '{NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}}NF==1{gsub(",","|")}1' OFS='"' FileName
This gives me the expected file data as pipe separated but I want to know if I can get specific column data for each row using awk.
That will fail if there are any literal double quote characters (which are represented in a CSV file by repeating that character) in a quoted string, as in this line, which has 4 fields:
Code:
123,"Tuesday","Amy said, ""Hello""",456
[EDIT]I see now that it does work if I take out that leading "{" character that was causing a syntax error and prevented me from trying it. As long as there are no embedded pipe characters in the quoted strings, it's fine. If there are, you are back to square 1.

There are lots of simpler methods if you aren't trying make a general solution that works for any valid CSV file.

Last edited by rknichols; 06-01-2016 at 09:47 PM.
 
Old 06-01-2016, 09:27 PM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Well done - I gave up after the last time we broached this. For simple, well-defined cases FPAT works fine, but that function looks the goods ...
 
Old 06-02-2016, 06:12 AM   #9
HMW
Member
 
Registered: Aug 2013
Location: Sweden
Distribution: Debian, Arch, Red Hat, CentOS
Posts: 773
Blog Entries: 3

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
I am most likely missing something (it is HOT today!), but wouldn't this also work?

Code:
echo '1234,some text,"some text, with comma",text,alphanumeric,"again with, comma",text' | sed -r 's/,([^ ])/|\1/g' | awk -F "|" '{ print $3 }'
"some text, with comma"
echo '1234,some text,"some text, with comma",text,alphanumeric,"again with, comma",text' | sed -r 's/,([^ ])/|\1/g' | awk -F "|" '{ print $6 }'
"again with, comma"

Seems to work with another test case:
Code:
echo '123,"Tuesday","Amy said, ""Hello""",456' | sed -r 's/,([^ ])/|\1/g' | awk -F "|" '{ print $3 }'
"Amy said, ""Hello"""
This was an interesting thing to try to solve, happy for any pointers as to where my regex might fail.

Best regards,
HMW

Last edited by HMW; 06-02-2016 at 06:13 AM.
 
1 members found this post helpful.
Old 06-02-2016, 08:21 AM   #10
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,774

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
Quote:
Originally Posted by HMW View Post
Code:
echo '1234,some text,"some text, with comma",text,alphanumeric,"again with, comma",text' | sed -r 's/,([^ ])/|\1/g' | awk -F "|" '{ print $3 }'
"some text, with comma"
echo '1234,some text,"some text, with comma",text,alphanumeric,"again with, comma",text' | sed -r 's/,([^ ])/|\1/g' | awk -F "|" '{ print $6 }'
"again with, comma"
That's yet another limited case, where the embedded comma is always followed by a space. Try it with
Code:
echo '123,"f(x,y)",456'
 
1 members found this post helpful.
Old 06-02-2016, 08:33 AM   #11
HMW
Member
 
Registered: Aug 2013
Location: Sweden
Distribution: Debian, Arch, Red Hat, CentOS
Posts: 773
Blog Entries: 3

Rep: Reputation: 369Reputation: 369Reputation: 369Reputation: 369
Quote:
Originally Posted by rknichols View Post
That's yet another limited case, where the embedded comma is always followed by a space. Try it with
Code:
echo '123,"f(x,y)",456'
Yes, you're right. It fails on this. I knew it was too easy, dang!

Best regards,
HMW
 
Old 06-02-2016, 08:56 AM   #12
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,774

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
Also, ir is legal for a field delimiter comma to be preceded or followed by a space.
Code:
123, xyz ,456
Whether that white space is to be considered part of the field is not universally defined. The function I posted does include the white space in the field. CSV can look simple, but the full syntax not quite that easy to parse.
 
Old 06-02-2016, 09:48 AM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Since others have mentioned it, here is the awk solution:
Code:
awk '{print $3"|"$6}' FPAT="([^,]+)|(\"[^\"]+\")" file
 
1 members found this post helpful.
Old 06-02-2016, 10:19 AM   #14
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,774

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
Quote:
Originally Posted by grail View Post
Since others have mentioned it, here is the awk solution:
Code:
awk '{print $3"|"$6}' FPAT="([^,]+)|(\"[^\"]+\")" file
Which works for simple CSV files, but fails if there are embedded literal quote characters, as in
Code:
111,222,"Amy said, ""Hello""",444,555,666
 
Old 06-02-2016, 11:09 AM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,005

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
hmmm .. true, I guess at the end of the day though we can all come up with something that a given solution will fail on
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to print lines in csv file if 1 csv column field = "text". There are 10 column (;) in csv file nexuslinux Linux - Newbie 9 04-22-2016 11:35 PM
Remove Column from .CSV using AWK command gaurvrishi Linux - Newbie 13 04-03-2015 11:14 AM
How to to strip commas from csv file but keep the fields separated? keenboy Linux - General 6 08-05-2013 06:05 AM
awk - rearrange column data in csv file to match columns wolverene13 Programming 9 12-21-2011 04:55 AM
Parsing a comma separated CSV file where fields have commas in to trickyflash Linux - General 7 03-26-2009 03:30 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 06:04 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration