LinuxQuestions.org - awk with CSV files

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - awk with CSV files (https://www.linuxquestions.org/questions/programming-9/awk-with-csv-files-881103/)

awk with CSV files

In AWK, is there any good way to make reference to fields for CSV files?

The difficulties are:
1) Some double-quoted values may contain commas.
2) Some field may be empty. i.e. concatenated commas.

e.g.

Code:

1,"Good Work, Ltd.",,1.3,

2,"Bad Day Company",1,0

so simply

Code:

FS = ","

won't work.

How do you work around to process CSVs with such trouble?

Thanks.

Hi, I use awk to get my bandwidth usage from my Broadband isp.

I need to extract field 4, and then remove non-numerical chars.

I hope this link helps you figure it out for your situation. $n depicts column number

http://www.ibm.com/developerworks/ai.../section5.html

My script looks like this... I'm no expert, but I hope this helps.

Code:

#!/bin/bash

echo "vmc-sum"

#echo "shell set to noclobber, remove old file if it exists" 

if [ -f /home/glenn/tmp/calls02.csv ]

then

  rm -f /home/glenn/tmp/calls02.csv

fi

awk -F "," ' { print $4 } ' /home/glenn/Downloads/calls_2011*.csv > /home/glenn/tmp/calls02.csv

#echo "shell set to noclobber, remove old file if it exists" 

if [ -f /home/glenn/tmp/calls03.csv ]

then

  rm -f /home/glenn/tmp/calls03.csv

fi

# remove mobile phone numbers from list, pretexted with + sign. Needs to be done before removal of other non-integer chars

# Mobile numbers (sms) in older vodafone .csv files are also in collumn four($4).

sed -e '/\+/d' -e s/'[^0-9]'/""/g /home/glenn/tmp/calls02.csv > /home/glenn/tmp/calls03.csv

echo

awk 'END { print "You have logged on " NR " times this month." }' /home/glenn/tmp/calls03.csv

#echo "the file, /home/glenn/tmp/calls03.csv, ready to import to OO-Calc"  #useful result, saves hours of manual editing.GW

#echo "shell set to noclobber, remove old file if it exists"

if [ -f /home/glenn/tmp/calls04.csv ]

then

  rm -f /home/glenn/tmp/calls04.csv

fi

echo

awk 'BEGIN {ORS=" "} //' /home/glenn/tmp/calls03.csv > /home/glenn/tmp/calls04.csv

echo "sum-calls"

echo

# echo "call sum-calls script, inspired by wlayton27 Atomicmpc.com.au forums."

    sh /home/glenn/bin/sum-calls

#mv /home/glenn/Downloads/calls_201*.csv /home/glenn/Documents/personalBusiness/Vodafone-bills/

#exit=0 # you might want to see the screen.

echo

date && whoami

here is a tutorial to work through...Using grep, sed, and awk
http://www.ibm.com/developerworks/ai.../section5.html
Regards Glenn

There is a good description of the CSV format (actually formats -- thee is no standard) here.

As well as the difficulties described in the OP, CSV files may include newlines in fields -- tricky for line-orientated awk -- and may not populate empty fields at the end of the record. Here's the input part of an awk script that can cope with both of those situations; it was written to parse an Outlook Contacts CSV:

Code:

#!/usr/bin/awk -f



# Converts Outlook Contacts CSV file from stdin to gnokii phonebook raw format on stdout



# A description of CSV conventions: http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm

# Outlook does not populate empty fields after the last populated field.



# From the gnokii man page:

# The raw phonebook format is very simple.  Each line represents one entry. 

# Fields are separated by semicolons.  Semicolons aren't allowed inside a field.



# This script converts semicolons to colons and replaces new lines with \n



BEGIN {

    # Get number of fields from the header

    # (assumed not to contain commas or newlines in the field names)

    FS = ","

    my_getline()

    header_NF = NF

    print "header_NF is " header_NF

}



function catenate_with_space ( base, addition )

{

    if ( length( base ) == 0 ) return addition

    return base addition

}



function get_field ()

{

    print "Started get_field()"

    if ( length( $0 ) == 0 ) { 

        print "Setting EoR empty field"

        fields[ n_field ] = ""

        return

    }

    char = substr( $0, 1, 1 )

    print "Character is " char

    if ( char == "," ) {

        fields[ n_field ] = ""

        $0 = substr( $0, 2 )

        return

    }

    if ( char != "\"" ) {

        print "Line " NR ": char is '" char "' not \" as expected" > "/dev/stderr"

        exit 1

    }

    for ( i = 2; ; i++ )

    {

        if ( length( $0 ) == i - 1 ) { 

            my_getline() 

            i = 1

            fields[ n_field ] = fields[ n_field ] "\\n"

            if ( length( $0 ) == 0 ) continue

        }

        char = substr( $0, i, 1 )

        print "Character is " char

        if ( char == "\"" )

        {

            if ( length( $0 ) > i ) {

                next_char = substr( $0, i + 1, 1 )

              if ( next_char == "," ) {

                    # End of field; discard following "," and return

                    $0 = substr( $0, i + 2 )

                    print "Field " n_field " is " fields[ n_field ] 

                    return

                }

                else if (next_char == "\"" ) {

                    # Is doubled " which is an embedded "

                    i++

                }

                else {

                    print "Line " NR ": character after field ended is '" next_char "' not , as expected" > "/dev/stderr"

                    exit 1

                }

            }

            else {

                # End of field and end of record

                $0 = ""

                print "Field " n_field " is " fields[ n_field ] 

                return

            }

        }

        else if ( char == ";" ) char = ":"

        fields[ n_field ] = fields[ n_field ] char

    }

    

}



function get_record ()

{

    n_field = 1

    while ( n_field < header_NF ) {

        get_field()

        n_field++

    }

}



function my_getline ()

{

    if ( getline <= 0 ) {

        m = "unexpected EOF or error"

        m = (m ": " ERRNO)

        print m > "/dev/stderr"

        exit 1

    }

}



function write_record ()

{

    <as required>

}



{

    print "Processing record " NR

    get_record()

    write_record()

}

I did something like this:

Code:

{

        j = 1;

        field[j] = ""

        grouping = 0

        for (i = 1; i <= NF; i++) {



                if ($(i) ~ /^\".*[^\"]$/) {

                        grouping ++

                }

                else if ($(i) ~ /^[^\"].*\"$/) {

                        grouping --

                }



                if (field[j] != "") {

                        field[j] = field[j] "," $(i)

                }

                else {

                        field[j] = $(i)

                }



                if (grouping == 0) {

                        j++

                        field[j] = ""

                }

        }



        # now field[*] should carry the correctly delimited fields

        #    j-1 is the actual number of fields in that line

        for (i = 1; i < j ; i++) print "[" i "]" field[i]

}

However, it is limited to getting around the 2 difficulties I originally posted.

I was along the same lines as only fixing the double quote scenario:

Code:

#!/usr/bin/awk -f



BEGIN{  OFS = FS = ","  }



/"/{

    for(i=1;i<=NF;i++){

        if($i ~ /^"[^"]+$/){

            for(x=i+1;x<=NF;x++){

                $i=$i","$x

                if($i ~ /"$/){

                    z = x - (i + 1) + 1

                    for(y=i+1;y<=NF;y++)

                        $y = $(y + z)

                    break

                }

            }

            NF = NF - z

            i=x

        }

    }

}



# replace below for your printing options.

1

And another alternative similar to your own:

Code:

#!/usr/bin/awk -f



BEGIN { FS = ","    }



{

    y = 1



    for(x=1; x <= NF; x++)

        if($x !~ /"[^"]+$/ && y)

            field[NR] = $x

        else{

            if(y){

                field[NR] = $x","$(x+1)

                x++

            }

            else

                field[NR] = field[NR]","$x



            y = field[NR] ~ /"[^"]+$/?0:1



        }



# do your stuff with field array here

}