LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 01-11-2012, 10:19 PM   #16
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 3,665

Rep: Reputation: 204Reputation: 204Reputation: 204

Here's a gawk script that might do the job for you. Note that this is gawk specific, since it uses a non-posix awk extension to the match() function.

Code:
BEGIN {
# Define C as the "csv" file field delimiter. (Usually a comma, but ...)
  C=","
# Define D as the delimiter with which to replace the input delimiter
  D=":"
# Define Q as a string containing a single quote (Not really necessary)
  Q="\""
}
# Main loop. Note the only $0 (the whole input line) is used.
# I.e., FS is not needed.
{
# Break the input into parts[1]"part[2]"part[3] where
# part[[1] ends with a C, and part[3] starts with a C.
# (Note tha the latter condition precludes the last field being
# the only quoted string in the line.)
  np=match($0, "(.*" C ")" Q "([^" Q "]*)" Q "(" C ".*)", part)
# Did we find a paired quote?
  if (np) {
# # Yes, we did. Replace any C by a D in the part preceeding
# the first quote and the part following the matching quote. (Note
# that this ignores any quotes following the first matched quote.)
    gsub(C,D,part[1])
    gsub(C,D,part[3])
    Out=part[1] Q part[2] Q part[3]
  }
  else {
# If here, no matched quotes in the line. Replace all C characters
# the new delimiter.
    Out=$0
    gsub(C,D,Out)
  }
# Testing output
  printf("\nNR=%d\n Input: '%s'\nOutput: '%s'\n",NR,$0,Out)
# Output to "out.csv" (Example)
# print Out > "out.csv"
}
And here's the test output:
Code:
$ gawk -f csv_parse.gawk quoted_csv.dat 

NR=1
 Input: '123.456.789.100,Authoritative,"blah.com, blah.biz, blah.gov",filler,,,,,'
Output: '123.456.789.100:Authoritative:"blah.com, blah.biz, blah.gov":filler:::::'

NR=2
 Input: '192.168.1.100,Recursive,"dummy.mil",,,,,,'
Output: '192.168.1.100:Recursive:"dummy.mil"::::::'

NR=3
 Input: '192.169.2.100,Authoritative,"metoo.com",more,stuff,out,here,to,ignore'
Output: '192.169.2.100:Authoritative:"metoo.com":more:stuff:out:here:to:ignore'

NR=4
 Input: '10.10.0.1,Recursive,"THEM.gov, us.com",,,,,,'
Output: '10.10.0.1:Recursive:"THEM.gov, us.com"::::::'

NR=5
 Input: '10.0.1.1,Recursive,"what.gov",Joseph,,Joanne,ignore,stuff,here'
Output: '10.0.1.1:Recursive:"what.gov":Joseph::Joanne:ignore:stuff:here'

NR=6
 Input: '10.0.0.2,Recursive,"UHOH.TV",,,,,,'
Output: '10.0.0.2:Recursive:"UHOH.TV"::::::'
 
Old 01-11-2012, 11:33 PM   #17
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Mint
Posts: 5,401

Rep: Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110
As David said earlier, if you have latest version of awk you can get your final result very easily. Do you have version 4 of gawk?
Quote:
Grails response converted some of the 3rd column commas to colons but left the others.
This would require there not to be commas after the first and second columns (which of course then changes how you define a column)
 
Old 01-11-2012, 11:53 PM   #18
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Mint
Posts: 5,401

Rep: Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110
btw. Here is what it could like in current version of gawk:
Code:
#!/usr/bin/awk -f

BEGIN{
    FPAT = "([^,]+)|(\"[^\"]+\")"
    OFS = ":"
}

{
    gsub(/\"/,"",$3)
    n = split($3, third, " *, *")

    for( i = 1; i <= n; i++ )
        print $1,$2,third
}
Once executable simply run:
Code:
./script.awk file
 
Old 01-12-2012, 07:37 AM   #19
oly_r
LQ Newbie
 
Registered: Dec 2011
Posts: 11

Original Poster
Rep: Reputation: Disabled
The problem comes that there is no standard formatting in the 3rd or subsequent columns. there is NO input validation done or QC on the data. I had believed that the 3rd column did have quotes around it but after a few of these examples were tried i found out differently. there can be 0-* entries in the 3rd column, the most i have actually found is 6. It is about 30k lines due to the way it was done.

All that is guaranteed is that the third column starts after the second comma and there are 6 commas after the 3rd column data. I'm not sure how the import function on the spreadsheet reliable breaks the 3rd column data out of the line but it does appear to.

192.168.1.1,[Authoritative|Recursive],{total mishmash of data here including possible commas, semi-colons, quotes, or even "TBD"},,,,,,

I will be trying the gawk and other options shortly. Unfortunately i don't have access to both the linux (segregated network) and windows (internet enabled) systems at the same time.

I really appreciate the help, I have learned quite a bit.
 
Old 01-12-2012, 08:12 PM   #20
danielbmartin
Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 291

Rep: Reputation: 60
Quote:
Originally Posted by oly_r View Post
... there can be 0-* entries in the 3rd column, the most i have actually found is 6.
Code:
cat $InFile \
|tr -d \” \
|tr -d ' ' \
|rev \
|cut -d, -f7- \
|rev \
|sed 's/$/,~,~,~,~,~,~/' \
|tr '[:upper:]' '[:lower:]' \
> $Work3

cat $Work3 |cut -d, -f1,2,3  > $Work4 
cat $Work3 |cut -d, -f1,2,4 >> $Work4 
cat $Work3 |cut -d, -f1,2,5 >> $Work4 
cat $Work3 |cut -d, -f1,2,6 >> $Work4 
cat $Work3 |cut -d, -f1,2,7 >> $Work4 
cat $Work3 |cut -d, -f1,2,8 >> $Work4 
cat $Work3 |cut -d, -f1,2,9 >> $Work4 
 
cat $Work4 |grep -v '~' |tr ',' ':' |sort  > $OutFile
Daniel B. Martin
 
Old 01-13-2012, 07:09 AM   #21
oly_r
LQ Newbie
 
Registered: Dec 2011
Posts: 11

Original Poster
Rep: Reputation: Disabled
Ok update time.

I combined a couple of the suggestions and came up with
Quote:
cat < $InFile |sed 's/"[[:digit:]]{3}-[[:digit:]]{4},*[^"]"//g' |
rev |cut -d ',' -f7- |rev | sed -e 's/,/:/' -e 's/,/:/' > $OutFile
I had to specify the phone number option because several lines had 2 column entries like "555-1212, Jane Doe"


Now the only headache is from the idiots that used lf/cr in the 3rd column. The entry in the file looks like:
Quote:
123.456.789.100,Authoritative,”blah.com
blah.biz
blah.gov”,filler,,,,
 
Old 01-19-2012, 11:47 AM   #22
oly_r
LQ Newbie
 
Registered: Dec 2011
Posts: 11

Original Poster
Rep: Reputation: Disabled
I've marked it solved since i have a working fix. I really appreciate all the posts.
 
Old 01-19-2012, 02:27 PM   #23
danielbmartin
Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Ubuntu
Posts: 291

Rep: Reputation: 60
Quote:
Originally Posted by oly_r View Post
Now the only headache is from the idiots that used lf/cr in the 3rd column. The entry in the file looks like:
123.456.789.100,Authoritative,”blah.com
blah.biz
blah.gov”,filler,,,,
Combining ideas from www.linuxhowtos.org, LQ member firstfire, and my own fevered imagination, I came up with a way to "clean up" a file which has these unwanted NewLine characters. It relies on the premise that every legitimate line begins with two numerics.

Code:
cat < $InFile                                  \
|sed -r ':a; $!N;s/\n([^0-9]{2})/,\1/;ta; P;D' \
> $OutFile
Daniel B. Martin
 
1 members found this post helpful.
Old 01-25-2012, 08:53 AM   #24
oly_r
LQ Newbie
 
Registered: Dec 2011
Posts: 11

Original Poster
Rep: Reputation: Disabled
Thanks, i just noticed there was an update. I appreciate the add-on, i'll try it shortly.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] AWK / SED - Parsing a CSV file with comma delimiter, and some extra needs. PenguinJr Programming 8 05-24-2011 06:28 PM
Adding a Comma to the end of every nth line in Vi (or sed). Euler2 Linux - Newbie 6 10-12-2009 09:38 AM
How to delete Comma in a comma separated file with double quotes as quote character pklcnu Linux - Newbie 2 03-24-2009 05:50 PM
using sed to remove line in a comma-delimited file seefor Programming 4 03-10-2009 03:35 PM


All times are GMT -5. The time now is 11:16 AM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration