LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   awk with CSV files (https://www.linuxquestions.org/questions/programming-9/awk-with-csv-files-881103/)

Jerry Mcguire 05-16-2011 09:58 PM

awk with CSV files
 
In AWK, is there any good way to make reference to fields for CSV files?

The difficulties are:
1) Some double-quoted values may contain commas.
2) Some field may be empty. i.e. concatenated commas.

e.g.
Code:

1,"Good Work, Ltd.",,1.3,
2,"Bad Day Company",1,0

so simply
Code:

FS = ","
won't work.

How do you work around to process CSVs with such trouble?

Thanks.

GlennsPref 05-16-2011 10:58 PM

Hi, I use awk to get my bandwidth usage from my Broadband isp.

I need to extract field 4, and then remove non-numerical chars.

I hope this link helps you figure it out for your situation. $n depicts column number

http://www.ibm.com/developerworks/ai.../section5.html

My script looks like this... I'm no expert, but I hope this helps.

Code:

#!/bin/bash
echo "vmc-sum"
#echo "shell set to noclobber, remove old file if it exists"
if [ -f /home/glenn/tmp/calls02.csv ]
then
  rm -f /home/glenn/tmp/calls02.csv
fi
awk -F "," ' { print $4 } ' /home/glenn/Downloads/calls_2011*.csv > /home/glenn/tmp/calls02.csv
#echo "shell set to noclobber, remove old file if it exists"
if [ -f /home/glenn/tmp/calls03.csv ]
then
  rm -f /home/glenn/tmp/calls03.csv
fi
# remove mobile phone numbers from list, pretexted with + sign. Needs to be done before removal of other non-integer chars
# Mobile numbers (sms) in older vodafone .csv files are also in collumn four($4).
sed -e '/\+/d' -e s/'[^0-9]'/""/g /home/glenn/tmp/calls02.csv > /home/glenn/tmp/calls03.csv
echo
awk 'END { print "You have logged on " NR " times this month." }' /home/glenn/tmp/calls03.csv
#echo "the file, /home/glenn/tmp/calls03.csv, ready to import to OO-Calc"  #useful result, saves hours of manual editing.GW
#echo "shell set to noclobber, remove old file if it exists"
if [ -f /home/glenn/tmp/calls04.csv ]
then
  rm -f /home/glenn/tmp/calls04.csv
fi
echo
awk 'BEGIN {ORS=" "} //' /home/glenn/tmp/calls03.csv > /home/glenn/tmp/calls04.csv
echo "sum-calls"
echo
# echo "call sum-calls script, inspired by wlayton27 Atomicmpc.com.au forums."
    sh /home/glenn/bin/sum-calls
#mv /home/glenn/Downloads/calls_201*.csv /home/glenn/Documents/personalBusiness/Vodafone-bills/
#exit=0 # you might want to see the screen.
echo
date && whoami

here is a tutorial to work through...Using grep, sed, and awk
http://www.ibm.com/developerworks/ai.../section5.html
Regards Glenn

catkin 05-16-2011 11:47 PM

There is a good description of the CSV format (actually formats -- thee is no standard) here.

As well as the difficulties described in the OP, CSV files may include newlines in fields -- tricky for line-orientated awk -- and may not populate empty fields at the end of the record. Here's the input part of an awk script that can cope with both of those situations; it was written to parse an Outlook Contacts CSV:
Code:

#!/usr/bin/awk -f

# Converts Outlook Contacts CSV file from stdin to gnokii phonebook raw format on stdout

# A description of CSV conventions: http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
# Outlook does not populate empty fields after the last populated field.

# From the gnokii man page:
# The raw phonebook format is very simple.  Each line represents one entry.
# Fields are separated by semicolons.  Semicolons aren't allowed inside a field.

# This script converts semicolons to colons and replaces new lines with \n

BEGIN {
    # Get number of fields from the header
    # (assumed not to contain commas or newlines in the field names)
    FS = ","
    my_getline()
    header_NF = NF
    print "header_NF is " header_NF
}

function catenate_with_space ( base, addition )
{
    if ( length( base ) == 0 ) return addition
    return base addition
}

function get_field ()
{
    print "Started get_field()"
    if ( length( $0 ) == 0 ) {
        print "Setting EoR empty field"
        fields[ n_field ] = ""
        return
    }
    char = substr( $0, 1, 1 )
    print "Character is " char
    if ( char == "," ) {
        fields[ n_field ] = ""
        $0 = substr( $0, 2 )
        return
    }
    if ( char != "\"" ) {
        print "Line " NR ": char is '" char "' not \" as expected" > "/dev/stderr"
        exit 1
    }
    for ( i = 2; ; i++ )
    {
        if ( length( $0 ) == i - 1 ) {
            my_getline()
            i = 1
            fields[ n_field ] = fields[ n_field ] "\\n"
            if ( length( $0 ) == 0 ) continue
        }
        char = substr( $0, i, 1 )
        print "Character is " char
        if ( char == "\"" )
        {
            if ( length( $0 ) > i ) {
                next_char = substr( $0, i + 1, 1 )
              if ( next_char == "," ) {
                    # End of field; discard following "," and return
                    $0 = substr( $0, i + 2 )
                    print "Field " n_field " is " fields[ n_field ]
                    return
                }
                else if (next_char == "\"" ) {
                    # Is doubled " which is an embedded "
                    i++
                }
                else {
                    print "Line " NR ": character after field ended is '" next_char "' not , as expected" > "/dev/stderr"
                    exit 1
                }
            }
            else {
                # End of field and end of record
                $0 = ""
                print "Field " n_field " is " fields[ n_field ]
                return
            }
        }
        else if ( char == ";" ) char = ":"
        fields[ n_field ] = fields[ n_field ] char
    }
   
}

function get_record ()
{
    n_field = 1
    while ( n_field < header_NF ) {
        get_field()
        n_field++
    }
}

function my_getline ()
{
    if ( getline <= 0 ) {
        m = "unexpected EOF or error"
        m = (m ": " ERRNO)
        print m > "/dev/stderr"
        exit 1
    }
}

function write_record ()
{
    <as required>
}

{
    print "Processing record " NR
    get_record()
    write_record()
}


Jerry Mcguire 05-17-2011 01:00 AM

I did something like this:
Code:

{
        j = 1;
        field[j] = ""
        grouping = 0
        for (i = 1; i <= NF; i++) {

                if ($(i) ~ /^\".*[^\"]$/) {
                        grouping ++
                }
                else if ($(i) ~ /^[^\"].*\"$/) {
                        grouping --
                }

                if (field[j] != "") {
                        field[j] = field[j] "," $(i)
                }
                else {
                        field[j] = $(i)
                }

                if (grouping == 0) {
                        j++
                        field[j] = ""
                }
        }

        # now field[*] should carry the correctly delimited fields
        #    j-1 is the actual number of fields in that line
        for (i = 1; i < j ; i++) print "[" i "]" field[i]
}

However, it is limited to getting around the 2 difficulties I originally posted.

grail 05-17-2011 01:17 AM

I was along the same lines as only fixing the double quote scenario:
Code:

#!/usr/bin/awk -f

BEGIN{  OFS = FS = ","  }

/"/{
    for(i=1;i<=NF;i++){
        if($i ~ /^"[^"]+$/){
            for(x=i+1;x<=NF;x++){
                $i=$i","$x
                if($i ~ /"$/){
                    z = x - (i + 1) + 1
                    for(y=i+1;y<=NF;y++)
                        $y = $(y + z)
                    break
                }
            }
            NF = NF - z
            i=x
        }
    }
}

# replace below for your printing options.
1


grail 05-17-2011 06:39 AM

And another alternative similar to your own:
Code:

#!/usr/bin/awk -f

BEGIN { FS = ","    }

{
    y = 1

    for(x=1; x <= NF; x++)
        if($x !~ /"[^"]+$/ && y)
            field[NR] = $x
        else{
            if(y){
                field[NR] = $x","$(x+1)
                x++
            }
            else
                field[NR] = field[NR]","$x

            y = field[NR] ~ /"[^"]+$/?0:1

        }

# do your stuff with field array here
}



All times are GMT -5. The time now is 09:20 PM.