LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-16-2011, 09:58 PM   #1
Jerry Mcguire
Member
 
Registered: Jul 2009
Location: Hong Kong SAR
Distribution: RedHat, Fedora
Posts: 156

Rep: Reputation: 19
awk with CSV files


In AWK, is there any good way to make reference to fields for CSV files?

The difficulties are:
1) Some double-quoted values may contain commas.
2) Some field may be empty. i.e. concatenated commas.

e.g.
Code:
1,"Good Work, Ltd.",,1.3,
2,"Bad Day Company",1,0
so simply
Code:
FS = ","
won't work.

How do you work around to process CSVs with such trouble?

Thanks.
 
Old 05-16-2011, 10:58 PM   #2
GlennsPref
Senior Member
 
Registered: Apr 2004
Location: Brisbane, Australia
Distribution: pclinuxos slackware64 tails kali
Posts: 3,372
Blog Entries: 33

Rep: Reputation: 217Reputation: 217Reputation: 217
Hi, I use awk to get my bandwidth usage from my Broadband isp.

I need to extract field 4, and then remove non-numerical chars.

I hope this link helps you figure it out for your situation. $n depicts column number

http://www.ibm.com/developerworks/ai.../section5.html

My script looks like this... I'm no expert, but I hope this helps.

Code:
#!/bin/bash
echo "vmc-sum"
#echo "shell set to noclobber, remove old file if it exists" 
if [ -f /home/glenn/tmp/calls02.csv ]
then
  rm -f /home/glenn/tmp/calls02.csv
fi
awk -F "," ' { print $4 } ' /home/glenn/Downloads/calls_2011*.csv > /home/glenn/tmp/calls02.csv
#echo "shell set to noclobber, remove old file if it exists" 
if [ -f /home/glenn/tmp/calls03.csv ]
then
  rm -f /home/glenn/tmp/calls03.csv
fi
# remove mobile phone numbers from list, pretexted with + sign. Needs to be done before removal of other non-integer chars
# Mobile numbers (sms) in older vodafone .csv files are also in collumn four($4).
sed -e '/\+/d' -e s/'[^0-9]'/""/g /home/glenn/tmp/calls02.csv > /home/glenn/tmp/calls03.csv
echo
awk 'END { print "You have logged on " NR " times this month." }' /home/glenn/tmp/calls03.csv
#echo "the file, /home/glenn/tmp/calls03.csv, ready to import to OO-Calc"  #useful result, saves hours of manual editing.GW
#echo "shell set to noclobber, remove old file if it exists"
if [ -f /home/glenn/tmp/calls04.csv ]
then
  rm -f /home/glenn/tmp/calls04.csv
fi
echo
awk 'BEGIN {ORS=" "} //' /home/glenn/tmp/calls03.csv > /home/glenn/tmp/calls04.csv
echo "sum-calls"
echo
# echo "call sum-calls script, inspired by wlayton27 Atomicmpc.com.au forums."
     sh /home/glenn/bin/sum-calls
#mv /home/glenn/Downloads/calls_201*.csv /home/glenn/Documents/personalBusiness/Vodafone-bills/
#exit=0 # you might want to see the screen.
echo
date && whoami
here is a tutorial to work through...Using grep, sed, and awk
http://www.ibm.com/developerworks/ai.../section5.html
Regards Glenn

Last edited by GlennsPref; 05-16-2011 at 11:03 PM. Reason: Typos. and add tute. ;)
 
Old 05-16-2011, 11:47 PM   #3
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,576
Blog Entries: 31

Rep: Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195Reputation: 1195
There is a good description of the CSV format (actually formats -- thee is no standard) here.

As well as the difficulties described in the OP, CSV files may include newlines in fields -- tricky for line-orientated awk -- and may not populate empty fields at the end of the record. Here's the input part of an awk script that can cope with both of those situations; it was written to parse an Outlook Contacts CSV:
Code:
#!/usr/bin/awk -f

# Converts Outlook Contacts CSV file from stdin to gnokii phonebook raw format on stdout

# A description of CSV conventions: http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# Outlook does not populate empty fields after the last populated field.

# From the gnokii man page:
# The raw phonebook format is very simple.  Each line represents one entry. 
# Fields are separated by semicolons.  Semicolons aren't allowed inside a field.

# This script converts semicolons to colons and replaces new lines with \n

BEGIN {
    # Get number of fields from the header
    # (assumed not to contain commas or newlines in the field names)
    FS = ","
    my_getline()
    header_NF = NF
    print "header_NF is " header_NF
}

function catenate_with_space ( base, addition )
{
    if ( length( base ) == 0 ) return addition
    return base addition
}

function get_field ()
{
    print "Started get_field()"
    if ( length( $0 ) == 0 ) { 
        print "Setting EoR empty field"
        fields[ n_field ] = ""
        return
    }
    char = substr( $0, 1, 1 )
    print "Character is " char
    if ( char == "," ) {
        fields[ n_field ] = ""
        $0 = substr( $0, 2 )
        return
    }
    if ( char != "\"" ) {
        print "Line " NR ": char is '" char "' not \" as expected" > "/dev/stderr"
        exit 1
    }
    for ( i = 2; ; i++ )
    {
        if ( length( $0 ) == i - 1 ) { 
            my_getline() 
            i = 1
            fields[ n_field ] = fields[ n_field ] "\\n"
            if ( length( $0 ) == 0 ) continue
        }
        char = substr( $0, i, 1 )
        print "Character is " char
        if ( char == "\"" )
        {
            if ( length( $0 ) > i ) {
                next_char = substr( $0, i + 1, 1 )
              if ( next_char == "," ) {
                    # End of field; discard following "," and return
                    $0 = substr( $0, i + 2 )
                    print "Field " n_field " is " fields[ n_field ] 
                    return
                }
                else if (next_char == "\"" ) {
                    # Is doubled " which is an embedded "
                    i++
                }
                else {
                    print "Line " NR ": character after field ended is '" next_char "' not , as expected" > "/dev/stderr"
                    exit 1
                }
            }
            else {
                # End of field and end of record
                $0 = ""
                print "Field " n_field " is " fields[ n_field ] 
                return
            }
        }
        else if ( char == ";" ) char = ":"
        fields[ n_field ] = fields[ n_field ] char
    }
    
}

function get_record ()
{
    n_field = 1
    while ( n_field < header_NF ) {
        get_field()
        n_field++
    }
}

function my_getline ()
{
    if ( getline <= 0 ) {
        m = "unexpected EOF or error"
        m = (m ": " ERRNO)
        print m > "/dev/stderr"
        exit 1
    }
}

function write_record ()
{
    <as required>
}

{
    print "Processing record " NR
    get_record()
    write_record()
}
 
Old 05-17-2011, 01:00 AM   #4
Jerry Mcguire
Member
 
Registered: Jul 2009
Location: Hong Kong SAR
Distribution: RedHat, Fedora
Posts: 156

Original Poster
Rep: Reputation: 19
I did something like this:
Code:
{
        j = 1;
        field[j] = ""
        grouping = 0
        for (i = 1; i <= NF; i++) {

                if ($(i) ~ /^\".*[^\"]$/) {
                        grouping ++
                }
                else if ($(i) ~ /^[^\"].*\"$/) {
                        grouping --
                }

                if (field[j] != "") {
                        field[j] = field[j] "," $(i)
                }
                else {
                        field[j] = $(i)
                }

                if (grouping == 0) {
                        j++
                        field[j] = ""
                }
        }

        # now field[*] should carry the correctly delimited fields
        #     j-1 is the actual number of fields in that line
        for (i = 1; i < j ; i++) print "[" i "]" field[i]
}
However, it is limited to getting around the 2 difficulties I originally posted.

Last edited by Jerry Mcguire; 05-17-2011 at 01:17 AM.
 
Old 05-17-2011, 01:17 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,437

Rep: Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842
I was along the same lines as only fixing the double quote scenario:
Code:
#!/usr/bin/awk -f

BEGIN{  OFS = FS = ","  }

/"/{
    for(i=1;i<=NF;i++){
        if($i ~ /^"[^"]+$/){
            for(x=i+1;x<=NF;x++){
                $i=$i","$x
                if($i ~ /"$/){
                    z = x - (i + 1) + 1
                    for(y=i+1;y<=NF;y++)
                        $y = $(y + z)
                    break
                }
            }
            NF = NF - z
            i=x
        }
    }
}

# replace below for your printing options.
1
 
Old 05-17-2011, 06:39 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,437

Rep: Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842Reputation: 2842
And another alternative similar to your own:
Code:
#!/usr/bin/awk -f

BEGIN { FS = ","    }

{
    y = 1

    for(x=1; x <= NF; x++)
        if($x !~ /"[^"]+$/ && y)
            field[NR] = $x
        else{
            if(y){
                field[NR] = $x","$(x+1)
                x++
            }
            else
                field[NR] = field[NR]","$x

            y = field[NR] ~ /"[^"]+$/?0:1

        }

# do your stuff with field array here
}
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
awk and csv file Firebar Programming 9 11-18-2010 05:00 AM
awk question on handling *.CSV "text fields" in awk jschiwal Programming 8 05-27-2010 06:23 AM
Using awk/sed to convert linefeed to csv, with some formatting jaykup Programming 1 04-03-2009 05:18 PM
AWK: change a particular field in a csv file help help help!!!! haydar68 Programming 20 08-03-2008 01:10 AM
Comparing two csv files and write different record in third CSV file irfanb146 Linux - Newbie 3 06-30-2008 09:15 PM


All times are GMT -5. The time now is 08:38 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration