LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-01-2011, 10:01 AM   #1
tnrooster
Member
 
Registered: Apr 2007
Location: bell buckle tn
Distribution: F13-F14
Posts: 88

Rep: Reputation: 15
sed or awk for csv file clarifying


Hello all ,

I've been tasked with accepting an odd csv feed from one of my clients.

Right now it has 26k rows, it will grow over time.

The alpha values are encapsulated with double quotes, the numeric values are not. Fields separated by commas.

6 or 7, for now, entries have a carriage return somewhere in the data. Addresses mainly, which throws off importing these into a database. I can obviously scan through and find the lines , manually deleting the useless carriage returns and all is well.

But I need this as a bash piece so I can forget about it.

Example:

Code:
1234,"Johnny's Pizza","111 Main St.","Yankee Town","PA","11111","US","Johnny Contact"
1235."Johnny's Too"," 112 Main St.
ste 2 ","Yankee Town","PA","11111","US","Johnny Other"
Also there is a header in the file.

The header values are not wrapped with quotes.

So the only qualifier I can come up with is no carriage returns after an odd number of quotes.

Is there any way to do this with sed or awk?

Or something else?

Thanks, Joe
 
Old 12-01-2011, 10:10 AM   #2
tronayne
Senior Member
 
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 3,541

Rep: Reputation: 1065Reputation: 1065Reputation: 1065Reputation: 1065Reputation: 1065Reputation: 1065Reputation: 1065Reputation: 1065
Ah, the lovely CVS file.

Here's a little shell program, dos2unx that will clean things up:
Code:
!/bin/sh
#
# dos2unx file [file...]
#
# Converts text files (names specified on command line) from MS-DOS
# format to UNIX format.  Essentially, gets rid of all newlines (\n),
# since linefeeds (\l) are all it needs.

if [ $# -lt 1 ]
then
        echo usage: dos2unx file [file ...]
        exit 1
fi

for FILE
do
        echo -n "dos2unx: converting ${FILE} ... "
        tr -d '\r' < ${FILE} > /tmp/conv$$
        rm -f ${FILE}
        cp -f /tmp/conv$$ ${FILE}
        rm -f /tmp/conv$$
        echo "done"
done
Try it on something you don't care about (like, you know, a copy of that file).

Hope this helps some.
 
Old 12-01-2011, 10:52 AM   #3
tnrooster
Member
 
Registered: Apr 2007
Location: bell buckle tn
Distribution: F13-F14
Posts: 88

Original Poster
Rep: Reputation: 15
Well that might help on some other files and I may have misspoke as the new line is what I need to get rid of.

The files are transfered, fed, through my FTP. I was to understand that took care of the different styles of line feeds?

My issue is the 6 to 7 rows that have line feed/carriage returns in the data field.

I need to filter just those out.
 
Old 12-01-2011, 01:36 PM   #4
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.
Try this perl script:
Code:
while(<>){
        s/(, *\"[^"]+)\n/$1/m;
        print;
}
`m' here stands for multi-line.

I use combination of comma and quote as indication of beginning of a string.
This code will not work if there are newline in a string in first column, because there are no comma before it.

Hope this helps.
 
Old 12-03-2011, 09:58 AM   #5
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Here's essential awk to parse such a file, including newlines in fields. It is taken from a utility that parses Outlook CSVs. Irrelevant code has been snipped from BEGIN. The msg function is not essential to how it works. The core of the functionality is in the get_field function. Header_NF is the number of fields as defined by the header. The my_getline function is essentially awk's getline but returns an error if there should be another line (field with embedded newline immediately before end of file)
Code:
BEGIN {

    # Get number of fields from the header
    # (assumed not to contain commas or newlines in the field names)
    FS = "," 
    my_getline()
    Header_NF = NF 

    # Loop over all lines in CSV
    while ( ( getline < Input_file ) > 0 ) {
        get_record()
    }
}

function my_getline ()
{
    if ( ( getline < Input_file ) <= 0 ) {
        msg( "E", "Unexpected end of file or error: " ERRNO )
    } 
    Record_num++
}

function get_record ( \
    n_field )
{
    n_field = 1
    while ( n_field < Header_NF ) {
        get_field( n_field )
        msg( "D90", "Record " Record_num ", field " n_field ": " Field[ n_field] )
        n_field++
    }
}

function get_field ( n_field )
{
    msg( "D100", "Started get_field()" )
    Field[ n_field ] = ""
    if ( length( $0 ) == 0 ) { 
        msg( "D100", "Setting EoR empty field" )
        return
    }
    char = substr( $0, 1, 1 )
    msg( "D100", "Character is " char )
    if ( char == "," ) {
        $0 = substr( $0, 2 )
        return
    }
    if ( char != "\"" ) {
        msg( "E", "Line " Record_num ": char is '" char "' not \" as expected" )
        exit 1
    }
    for ( i = 2; ; i++ )
    {
        if ( length( $0 ) == i - 1 ) { 
            my_getline() 
            i = 1
            Field[ n_field ] = Field[ n_field ] "\\n"
            if ( length( $0 ) == 0 ) continue
        }
        char = substr( $0, i, 1 )
        msg( "D100", "Character is " char )
        if ( char == "\"" )
        {
            if ( length( $0 ) > i ) {
                next_char = substr( $0, i + 1, 1 )
                if ( next_char == "," ) {
                    # End of field; discard following "," and return
                    $0 = substr( $0, i + 2 )
                    msg( "D100", "Field " n_field " is " Field[ n_field ] )
                    return
                }
                else if (next_char == "\"" ) {
                    # Is doubled " which is an embedded "
                    i++
                }
                else {
                    msg( "E", \
                        "Line " Record_num ": char is '" char "' not \" as expected" )
                }
            }
            else {
                # End of field and end of record
                $0 = ""
                msg( "D100", "Field " n_field " is " Field[ n_field ] )
                return
            }
        }
        Field[ n_field ] = Field[ n_field ] char
    }
    
}
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk or sed to use CSV as input and XML as template and output to a single file bridrod Linux - Newbie 6 03-13-2012 07:00 PM
[SOLVED] AWK / SED - Parsing a CSV file with comma delimiter, and some extra needs. PenguinJr Programming 8 05-24-2011 06:28 PM
awk and csv file Firebar Programming 9 11-18-2010 05:00 AM
Using awk/sed to convert linefeed to csv, with some formatting jaykup Programming 1 04-03-2009 05:18 PM
AWK: change a particular field in a csv file help help help!!!! haydar68 Programming 20 08-03-2008 01:10 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:31 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration