LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 05-24-2011, 12:38 PM   #1
PenguinJr
LQ Newbie
 
Registered: May 2011
Posts: 7

Rep: Reputation: 0
Question AWK / SED - Parsing a CSV file with comma delimiter, and some extra needs.


Hi,

I'm requiring your help for a problem that got me busy for almost the whole day : I have an input file in CSV format with comma delimiters :

Quote:
A,1,John,Carpenter,27,,"Carpenter, John",+169457,,BDG
A,1,James,Hill,61,SRR,"Hill, James",,,"FFB, BFB, NSO"
A,4,Eva,Brown,31,,"Brown, Eva",000916,D'Aboville & D'Aboville,SNX
And here is what I'm trying to get :

Quote:
'A','1','John','Carpenter','27','','Carpenter, John','+169457','','BDG'
'A','1','James','Hill','61','SRR','Hill, James','','','FFB, BFB, NSO'
'A','4','Eva','Brown','31','','Brown, Eva','000916','D''Aboville & D''Aboville','SNX'
In other words, I'm trying to apply the following rules :
1) All fields separated by a comma (including blank ones) must be simple-quoted, unless this (these) comma(s) belongs to a double-quoted field.
2) All simple-quotes contained within a field must be doubled.
3) If a field is already encapsulated in double-quotes, replace them by simple quotes.

My main problem is that I don't know how to code in AWK or SED the fact to ignore comma(s) within double-quotes when parsing the fields. I'm pretty sure that once this step is done, I can do the rest.

Many thanks,
PenguinJr

Last edited by PenguinJr; 05-24-2011 at 12:43 PM.
 
Old 05-24-2011, 02:10 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 23,005
Blog Entries: 11

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
While parsing CSV can be done in awk, it's not straight forward
or trivial; I'd recommend using perl, python or some language that
has modules built for that very purpose.


Cheers,
Tink
 
Old 05-24-2011, 02:17 PM   #3
MTK358
LQ 5k Club
 
Registered: Sep 2009
Posts: 6,443
Blog Entries: 3

Rep: Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714Reputation: 714
I also thought that it might be a better idea to use a "full" programming language for this, but I didn't want to take this thread off the zero reply list.

Code:
print "'"
for every character in the line
    is it a comma?
        print "','"
    or is it a double quote?
        skip forward until the next double quote, printing each character.
    or is it a single quote?
        print "''"
    it's neither of the above
        print the current character
print "'"
And by the way, they're "single" quotes, not "simple" quotes.
 
Old 05-24-2011, 02:26 PM   #4
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Hi,

this worked with your sample data. Not sure if there is a better, shorter way to do it in awk.
Code:
#!/usr/bin/gawk -f
BEGIN { FS="," }
{	i=0
	c=","
	while(i++<NF) {
		if (i==NF) { c="" }
		gsub("'","''",$i)
		if ($i~/^"/) {
			sub("\"","'",$i)
			a=!a
		}
		if ($i~/"$/) {
			sub("\"","'",$i)
			a=!a
			printf "%s%s",$i,c
			continue
		}
		if (!a) {
			printf "'%s'%s",$i,c
			continue
		}			
		printf "%s%s",$i,c
	}
	printf "\n"
}
And the result:
Code:
$ cat file
A,1,John,Carpenter,27,,"Carpenter, John",+169457,,BDG
A,1,James,Hill,61,SRR,"Hill, James",,,"FFB, BFB, NSO"
A,4,Eva,Brown,31,,"Brown, Eva",000916,D'Aboville & D'Aboville,SNX 
$ 
$ ./awk.scr file
'A','1','John','Carpenter','27','','Carpenter, John','+169457','','BDG'
'A','1','James','Hill','61','SRR','Hill, James','','','FFB, BFB, NSO'
'A','4','Eva','Brown','31','','Brown, Eva','000916','D''Aboville & D''Aboville','SNX '
$

Last edited by crts; 05-24-2011 at 02:27 PM.
 
1 members found this post helpful.
Old 05-24-2011, 02:57 PM   #5
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,563
Blog Entries: 29

Rep: Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179
There's an awk script to parse a CSV file that might give some ideas in this LQ post.
 
Old 05-24-2011, 03:53 PM   #6
PenguinJr
LQ Newbie
 
Registered: May 2011
Posts: 7

Original Poster
Rep: Reputation: 0
Thank you for all these accurate and fast answers, I'm already in the awk manual trying to figure out how crts script is working, and also checking catkin's link.
Regards,
PenguinJr
 
Old 05-24-2011, 04:01 PM   #7
sundialsvcs
Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 5,455

Rep: Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172Reputation: 1172
Feel free to do it the hard way, but that's what Perl is for. The Linux/Unix environments give you an extremely rich toolkit: use it well.
 
Old 05-24-2011, 06:37 PM   #8
Birei
LQ Newbie
 
Registered: Nov 2010
Posts: 17

Rep: Reputation: 6
Hi,

Using 'Perl':
Code:
$ cat script.pl
use strict;
use warnings;
use autodie;

@ARGV == 1 or die "Usage: perl $0 input-file\n";
open my $fh, "<", $ARGV[0];
my (@f, $i);

while ( <$fh> ) {
	chomp;
	@f = split /,/;
	for ( 0 .. $#f ) {
		next unless exists $f[$_];
		$i = $_;
		if ( index( $f[$i], qw(") ) != -1 ) {
			while ( ++$i <= $#f and index( $f[$i], qw(") ) == -1 ) {
				$f[$_] = join ",", $f[$_], splice @f, $i, 1; 
			}
			$f[$_] = join ",", $f[$_], splice @f, ( $i > $#f ? $#f : $i ), 1; 
		}
		$f[$_] =~ s/'/$& x 2/ge;
		$f[$_] =~ tr/"//d;
		$_ = $i;
	}

	print +(join ",", map( { s/.*/'$&'/; $_ } @f) ), "\n";
}
close $fh;
$ perl script.pl infile
'A','1','John','Carpenter','27','','Carpenter, John','+169457','','BDG'
'A','1','James','Hill','61','SRR','Hill, James','','','FFB, BFB, NSO'
'A','4','Eva','Brown','31','','Brown, Eva','000916','D''Aboville & D''Aboville','SNX '
Regards,
Birei
 
Old 05-24-2011, 07:28 PM   #9
atavus1
LQ Newbie
 
Registered: Jul 2010
Location: Bogota
Distribution: fedora
Posts: 7

Rep: Reputation: 0
...

Last edited by atavus1; 05-24-2011 at 11:26 PM.
 
  


Reply

Tags
awk, csv, parsing, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk or sed to use CSV as input and XML as template and output to a single file bridrod Linux - Newbie 6 03-13-2012 08:00 PM
Parsing through a Nagios config file to extract info w/ Sed, Awk, Vi, etc. chudster Linux - General 3 10-14-2010 03:18 AM
Using awk/sed to convert linefeed to csv, with some formatting jaykup Programming 1 04-03-2009 06:18 PM
Parsing a comma separated CSV file where fields have commas in to trickyflash Linux - General 7 03-26-2009 04:30 PM
Manipulating comma delimited CSV - newbie help with sed etc jonnymorris Programming 16 09-19-2008 07:14 AM


All times are GMT -5. The time now is 08:22 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration