sed or awk for csv file clarifying

tnrooster · 12-01-2011, 10:01 AM

Hello all ,

I've been tasked with accepting an odd csv feed from one of my clients.

Right now it has 26k rows, it will grow over time.

The alpha values are encapsulated with double quotes, the numeric values are not. Fields separated by commas.

6 or 7, for now, entries have a carriage return somewhere in the data. Addresses mainly, which throws off importing these into a database. I can obviously scan through and find the lines , manually deleting the useless carriage returns and all is well.

But I need this as a bash piece so I can forget about it.

Example:

Code:

1234,"Johnny's Pizza","111 Main St.","Yankee Town","PA","11111","US","Johnny Contact"
1235."Johnny's Too"," 112 Main St.
ste 2 ","Yankee Town","PA","11111","US","Johnny Other"

Also there is a header in the file.

The header values are not wrapped with quotes.

So the only qualifier I can come up with is no carriage returns after an odd number of quotes.

Is there any way to do this with sed or awk?

Or something else?

Thanks, Joe

tronayne · 12-01-2011, 10:10 AM

Ah, the lovely CVS file.

Here's a little shell program, dos2unx that will clean things up:

Code:

!/bin/sh
#
# dos2unx file [file...]
#
# Converts text files (names specified on command line) from MS-DOS
# format to UNIX format.  Essentially, gets rid of all newlines (\n),
# since linefeeds (\l) are all it needs.

if [ $# -lt 1 ]
then
        echo usage: dos2unx file [file ...]
        exit 1
fi

for FILE
do
        echo -n "dos2unx: converting ${FILE} ... "
        tr -d '\r' < ${FILE} > /tmp/conv$$
        rm -f ${FILE}
        cp -f /tmp/conv$$ ${FILE}
        rm -f /tmp/conv$$
        echo "done"
done

Try it on something you don't care about (like, you know, a copy of that file).

Hope this helps some.

tnrooster · 12-01-2011, 10:52 AM

Well that might help on some other files and I may have misspoke as the new line is what I need to get rid of.

The files are transfered, fed, through my FTP. I was to understand that took care of the different styles of line feeds?

My issue is the 6 to 7 rows that have line feed/carriage returns in the data field.

I need to filter just those out.

firstfire · 12-01-2011, 01:36 PM

Hi.
Try this perl script:

Code:

while(<>){
        s/(, *\"[^"]+)\n/$1/m;
        print;
}

`m' here stands for multi-line.

I use combination of comma and quote as indication of beginning of a string.
This code will not work if there are newline in a string in first column, because there are no comma before it.

Hope this helps.

catkin · 12-03-2011, 09:58 AM

Here's essential awk to parse such a file, including newlines in fields. It is taken from a utility that parses Outlook CSVs. Irrelevant code has been snipped from BEGIN. The msg function is not essential to how it works. The core of the functionality is in the get_field function. Header_NF is the number of fields as defined by the header. The my_getline function is essentially awk's getline but returns an error if there should be another line (field with embedded newline immediately before end of file)

Code:

BEGIN {

    # Get number of fields from the header
    # (assumed not to contain commas or newlines in the field names)
    FS = "," 
    my_getline()
    Header_NF = NF 

    # Loop over all lines in CSV
    while ( ( getline < Input_file ) > 0 ) {
        get_record()
    }
}

function my_getline ()
{
    if ( ( getline < Input_file ) <= 0 ) {
        msg( "E", "Unexpected end of file or error: " ERRNO )
    } 
    Record_num++
}

function get_record ( \
    n_field )
{
    n_field = 1
    while ( n_field < Header_NF ) {
        get_field( n_field )
        msg( "D90", "Record " Record_num ", field " n_field ": " Field[ n_field] )
        n_field++
    }
}

function get_field ( n_field )
{
    msg( "D100", "Started get_field()" )
    Field[ n_field ] = ""
    if ( length( $0 ) == 0 ) { 
        msg( "D100", "Setting EoR empty field" )
        return
    }
    char = substr( $0, 1, 1 )
    msg( "D100", "Character is " char )
    if ( char == "," ) {
        $0 = substr( $0, 2 )
        return
    }
    if ( char != "\"" ) {
        msg( "E", "Line " Record_num ": char is '" char "' not \" as expected" )
        exit 1
    }
    for ( i = 2; ; i++ )
    {
        if ( length( $0 ) == i - 1 ) { 
            my_getline() 
            i = 1
            Field[ n_field ] = Field[ n_field ] "\\n"
            if ( length( $0 ) == 0 ) continue
        }
        char = substr( $0, i, 1 )
        msg( "D100", "Character is " char )
        if ( char == "\"" )
        {
            if ( length( $0 ) > i ) {
                next_char = substr( $0, i + 1, 1 )
                if ( next_char == "," ) {
                    # End of field; discard following "," and return
                    $0 = substr( $0, i + 2 )
                    msg( "D100", "Field " n_field " is " Field[ n_field ] )
                    return
                }
                else if (next_char == "\"" ) {
                    # Is doubled " which is an embedded "
                    i++
                }
                else {
                    msg( "E", \
                        "Line " Record_num ": char is '" char "' not \" as expected" )
                }
            }
            else {
                # End of field and end of record
                $0 = ""
                msg( "D100", "Field " n_field " is " Field[ n_field ] )
                return
            }
        }
        Field[ n_field ] = Field[ n_field ] char
    }
    
}