LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-10-2012, 08:13 AM   #1
oly_r
Member
 
Registered: Dec 2011
Posts: 31

Rep: Reputation: Disabled
bash/sed/awk to convert comma's not in quotes in a line with many comma's


Ok, i've been dealing with sorting/converting/massaging of files with DNS entries. I had been using the csv file that was given to me that has no input validation. One of the columns has comma separated entries so i've been reading it back into the spreadsheet so i could export it as colon separated instead of comma separated. I believe that i should be able to do a sed conversion of the lines but can't get it to ignore text between the quotes.

example of what it looks like.

123.456.789.100,Authoritative,”blah.com, blah.biz, blah.gov”,filler,,,,,
192.168.1.100,Recursive,”dummy.mil”,,,,,,
192.169.2.100,Authoritative,”metoo.com”,more,stuff,out,here,to,ignore
10.10.0.1,Recursive,”THEM.gov, us.com”,,,,,,
10.0.1.1,Recursive,”what.gov”,Joseph,,Joanne,ignore,stuff,here
10.0.0.2,Recursive,”UHOH.TV”,,,,,,

I don't believe i can't just use sed to do the change the file to use : or ; instead of , between the fields outsite the quotes..
 
Old 01-10-2012, 09:48 AM   #2
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,398
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
For reasons stemming from the type of regex engine used by sed (and other regex-using tools such as Perl and Awk), pattern matching involving balanced delimiters such as parentheses and quotes is very difficult, and generally can't be done with a single regular expression. That is why there are special modules for Perl and other languages that use regular expressions for parsing CSV formatted files.
If your data is sufficiently consistent, you can probably make some assumptions about where fields start and end by matching the ordered combinations of quotes and commas, where a pattern like /,"/ signals tha start of a new field and a /",/ signals the end of a field. If the content of any quote-delimited field may contain a leading or closing comma, then you have a bigger problem, and a ready-built CSV parser module might be a better choice.

--- rod.
 
Old 01-10-2012, 10:18 AM   #3
oly_r
Member
 
Registered: Dec 2011
Posts: 31

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by theNbomr View Post
For reasons stemming from the type of regex engine used by sed (and other regex-using tools such as Perl and Awk), pattern matching involving balanced delimiters such as parentheses and quotes is very difficult, and generally can't be done with a single regular expression. That is why there are special modules for Perl and other languages that use regular expressions for parsing CSV formatted files.
If your data is sufficiently consistent, you can probably make some assumptions about where fields start and end by matching the ordered combinations of quotes and commas, where a pattern like /,"/ signals tha start of a new field and a /",/ signals the end of a field. If the content of any quote-delimited field may contain a leading or closing comma, then you have a bigger problem, and a ready-built CSV parser module might be a better choice.

--- rod.
Ok, that makes me feel a little better. I do not need to do this in a single expression or command and the format is very specific. I only care about the first 3 fields from this file and the third one is the only one that i don't want to change the commas in. CRAP I can just change the fist 2 commas on any line, well no that doesn't handle being able to keep the 3rd field together if it has comma's. (yes i'm actually typing what i'm thinking at the moment).

Rick.
 
Old 01-10-2012, 10:49 AM   #4
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,729

Rep: Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590
Input file:
Code:
123.456.789.100,Authoritative,”blah.com, blah.biz, blah.gov”,filler,,,,,
192.168.1.100,Recursive,”dummy.mil”,,,,,,
192.169.2.100,Authoritative,”metoo.com”,more,stuff,out,here,to,ignore
10.10.0.1,Recursive,”THEM.gov, us.com”,,,,,,
10.0.1.1,Recursive,”what.gov”,Joseph,,Joanne,ignore,stuff,here
10.0.0.2,Recursive,”UHOH.TV”,,,,,,
Try this...
Code:
cat < $InFile | sed 's/”.*”//g' > $OutFile
Output file:
Code:
123.456.789.100,Authoritative,,filler,,,,,
192.168.1.100,Recursive,,,,,,,
192.169.2.100,Authoritative,,more,stuff,out,here,to,ignore
10.10.0.1,Recursive,,,,,,,
10.0.1.1,Recursive,,Joseph,,Joanne,ignore,stuff,here
10.0.0.2,Recursive,,,,,,,
 
Old 01-10-2012, 11:34 AM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,719

Rep: Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034
Close Daniel but you will run into issues if there are more than set of quotes on the line, example:
Code:
123.456.789.100,Authoritative,”blah.com, blah.biz", stuff to keep, "blah.gov”,filler,,,,,
Here the output will be the same as you have already shown yet 'stuff to keep' should have stayed.
Where as negating your search can help here:
Code:
sed '/"[^"]*"//g' file
Although I do not believe this is not what the OP requires either.

Oly, please show what the output should look like using your current example?
 
Old 01-10-2012, 11:51 AM   #6
oly_r
Member
 
Registered: Dec 2011
Posts: 31

Original Poster
Rep: Reputation: Disabled
Sorry bout that, i should have included the after.

I would like it to look like this. Base on the original example.

123.456.789.100:Authoritative:”blah.com, blah.biz, blah.gov”
192.168.1.100:Recursive:”dummy.mil”
192.169.2.100:Authoritative:”metoo.com”
10.10.0.1:Recursive:”THEM.gov, us.com”
10.0.1.1:Recursive:”what.gov”
10.0.0.2:Recursive:”UHOH.TV”




I will actually be modifying this result with scripted commands to end up with one ip, server type, and domain per line separated by ":". I also move the domains to all lowercase. this part i have already set up since i manually modified the initial file in the spreadsheet program.

123.456.789.100:Authoritative:blah.com
123.456.789.100:Authoritative:blah.biz
123.456.789.100:Authoritative:blah.gov
192.168.1.100:Recursive:dummy.mil
192.169.2.100:Authoritative:metoo.com
10.10.0.1:Recursive:them.gov
10.10.0.1:Recursive:us.com
10.0.1.1:Recursive:what.gov
10.0.0.2:Recursive:uhoh.tv
 
Old 01-10-2012, 02:19 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,729

Rep: Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590
This is "brute force" code but I think it does what you want.
Sometimes "brute force" is easier to understand and faster to execute
than a single sed with an elaborate RegEx.

Code:
cat < $InFile |tr ',' ':' |cut -d: -f1,2 > $Work1  
cat < $InFile |tr ',' ':' |cut -d: -f3- |sed 's/\”:.*$/”/' |tr ':' ',' > $Work2
paste -d':' $Work1 $Work2 > $OutFile
Daniel B. Martin
 
1 members found this post helpful.
Old 01-11-2012, 05:03 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,719

Rep: Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034Reputation: 3034
Maybe something like:
Code:
sed -e 's/,/:/' -e 's/,/:/' -e 's/"[^"]*$/"/' file
My sedfu is a little weak so someone can probably do the first 2 in one step
 
Old 01-11-2012, 09:33 AM   #9
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,508

Rep: Reputation: 1811Reputation: 1811Reputation: 1811Reputation: 1811Reputation: 1811Reputation: 1811Reputation: 1811Reputation: 1811Reputation: 1811Reputation: 1811Reputation: 1811
Quote:
Originally Posted by theNbomr View Post
For reasons stemming from the type of regex engine used by sed (and other regex-using tools such as Perl and Awk), pattern matching involving balanced delimiters such as parentheses and quotes is very difficult, and generally can't be done with a single regular expression.
Hold on, handling quotes is much easier than handling parentheses: parens can be nested, quotes can't.

Quote:
Originally Posted by oly_r
I don't believe i can't just use sed to do the change the file to use : or ; instead of , between the fields outsite the quotes..
Well it's awkward to do loops in sed, so here is some awk:
Code:
awk -F\" -vOFS=\" '{for (i=1; i <= NF; i+=2) gsub(",", ":", $i); print}'
Quote:
Originally Posted by grail
My sedfu is a little weak so someone can probably do the first 2 in one step
Code:
-e 's/,/:/' -e 's/,/:/'
-e 's/,\([^,]*\),/:\1:/'
Hmm, one step actually came out longer than 2 steps...
 
Old 01-11-2012, 10:31 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian + kde 4 / 5
Posts: 6,842

Rep: Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004Reputation: 2004
Recent versions of gawk (from version 4?) include FPAT/patsplit, desigined specifically for handling this kind of situation.

http://www.gnu.org/software/gawk/man...y-Content.html
 
Old 01-11-2012, 12:30 PM   #11
oly_r
Member
 
Registered: Dec 2011
Posts: 31

Original Poster
Rep: Reputation: Disabled
I have tried these fixes and danielbmartin worked for all the lines that did have Quotes. Grails response converted some of the 3rd column commas to colons but left the others. SO I found that not all the 3rd column entries were enclosed in quotes. Now i'm wondering if using danielbmartin concept of splitting the line is there a way to cut the LAST 6 fields off of a line. so looking at my samples

Quote:
123.456.789.100,Authoritative,”blah.com, blah.biz, blah.gov”,filler,,,,,
192.168.1.100,Recursive,”dummy.mil”,,,,,,
192.169.2.100,Authoritative,”metoo.com”,more,stuff,out,here,to,ignore
10.10.0.1,Recursive,”THEM.gov, us.com”,,,,,,
i've tried a couple variants of the following and it is not working. I'm sure others here can point out how stupid this looks but i'm very frustrated.
Quote:
sed 's/.*,.*,.*,.*,.*,.*,.*$//'

FYI, we are trying to get the people that have made the data available to us to either make sure there isn't comma separated data in the 3rd column or use a different delimiter. Unfortunately they don't seem to care.
 
Old 01-11-2012, 01:34 PM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,729

Rep: Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590
Quote:
Originally Posted by oly_r View Post
... Now i'm wondering if using danielbmartin concept of splitting the line is there a way to cut the LAST 6 fields off of a line.
Please elaborate your question(s) by providing a sample input file and a corresponding output file.

Perhaps this is what you seek:
Code:
cat < $InFile |sed 's/\”,.*$/”/' > $OutFile
Daniel B. Martin
 
Old 01-11-2012, 01:57 PM   #13
oly_r
Member
 
Registered: Dec 2011
Posts: 31

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
Please elaborate your question(s) by providing a sample input file and a corresponding output file.

Perhaps this is what you seek:
Code:
cat < $InFile |sed 's/\”,.*$/”/' > $OutFile
Daniel B. Martin
is there a way to take

192.168.1.100,Recursive,”dummy.mil”,,,,,,

and trim off the end starting at the 6th comma from the end ($) counting backwards

192.168.1.100,Recursive,”dummy.mil”

Last edited by oly_r; 01-11-2012 at 02:05 PM.
 
Old 01-11-2012, 02:16 PM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,729

Rep: Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590
Quote:
Originally Posted by oly_r View Post
is there a way to take

192.168.1.100,Recursive,”dummy.mil”,,,,,,

and trim off the end starting at the 6th comma from the end ($) counting backwards

192.168.1.100,Recursive,”dummy.mil”
Consider this:
Code:
cat < $InFile |rev |cut -d ',' -f7- |rev > $OutFile
Daniel B. Martin
 
Old 01-11-2012, 08:01 PM   #15
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,729

Rep: Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590Reputation: 590
The third field in your sample input file ...
Quote:
123.456.789.100,Authoritative,”blah.com, blah.biz, blah.gov”,filler,,,,,
192.168.1.100,Recursive,”dummy.mil”,,,,,,
192.169.2.100,Authoritative,”metoo.com”,more,stuff,out,here,to,ignore
10.10.0.1,Recursive,”THEM.gov, us.com”,,,,,,
... is always a quoted string containing 1, 2, or 3 comma-delimited items. Is the maximum number 3? If not, is the maximum number known? What is it?

Daniel B. Martin
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] AWK / SED - Parsing a CSV file with comma delimiter, and some extra needs. PenguinJr Programming 8 05-24-2011 06:28 PM
Adding a Comma to the end of every nth line in Vi (or sed). Euler2 Linux - Newbie 6 10-12-2009 09:38 AM
How to delete Comma in a comma separated file with double quotes as quote character pklcnu Linux - Newbie 2 03-24-2009 05:50 PM
using sed to remove line in a comma-delimited file seefor Programming 4 03-10-2009 03:35 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:23 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration