[SOLVED] bash/sed/awk to convert comma's not in quotes in a line with many comma's
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
bash/sed/awk to convert comma's not in quotes in a line with many comma's
Ok, i've been dealing with sorting/converting/massaging of files with DNS entries. I had been using the csv file that was given to me that has no input validation. One of the columns has comma separated entries so i've been reading it back into the spreadsheet so i could export it as colon separated instead of comma separated. I believe that i should be able to do a sed conversion of the lines but can't get it to ignore text between the quotes.
For reasons stemming from the type of regex engine used by sed (and other regex-using tools such as Perl and Awk), pattern matching involving balanced delimiters such as parentheses and quotes is very difficult, and generally can't be done with a single regular expression. That is why there are special modules for Perl and other languages that use regular expressions for parsing CSV formatted files.
If your data is sufficiently consistent, you can probably make some assumptions about where fields start and end by matching the ordered combinations of quotes and commas, where a pattern like /,"/ signals tha start of a new field and a /",/ signals the end of a field. If the content of any quote-delimited field may contain a leading or closing comma, then you have a bigger problem, and a ready-built CSV parser module might be a better choice.
For reasons stemming from the type of regex engine used by sed (and other regex-using tools such as Perl and Awk), pattern matching involving balanced delimiters such as parentheses and quotes is very difficult, and generally can't be done with a single regular expression. That is why there are special modules for Perl and other languages that use regular expressions for parsing CSV formatted files.
If your data is sufficiently consistent, you can probably make some assumptions about where fields start and end by matching the ordered combinations of quotes and commas, where a pattern like /,"/ signals tha start of a new field and a /",/ signals the end of a field. If the content of any quote-delimited field may contain a leading or closing comma, then you have a bigger problem, and a ready-built CSV parser module might be a better choice.
--- rod.
Ok, that makes me feel a little better. I do not need to do this in a single expression or command and the format is very specific. I only care about the first 3 fields from this file and the third one is the only one that i don't want to change the commas in. CRAP I can just change the fist 2 commas on any line, well no that doesn't handle being able to keep the 3rd field together if it has comma's. (yes i'm actually typing what i'm thinking at the moment).
I will actually be modifying this result with scripted commands to end up with one ip, server type, and domain per line separated by ":". I also move the domains to all lowercase. this part i have already set up since i manually modified the initial file in the spreadsheet program.
This is "brute force" code but I think it does what you want.
Sometimes "brute force" is easier to understand and faster to execute
than a single sed with an elaborate RegEx.
For reasons stemming from the type of regex engine used by sed (and other regex-using tools such as Perl and Awk), pattern matching involving balanced delimiters such as parentheses and quotes is very difficult, and generally can't be done with a single regular expression.
Hold on, handling quotes is much easier than handling parentheses: parens can be nested, quotes can't.
Quote:
Originally Posted by oly_r
I don't believe i can't just use sed to do the change the file to use : or ; instead of , between the fields outsite the quotes..
Well it's awkward to do loops in sed, so here is some awk:
I have tried these fixes and danielbmartin worked for all the lines that did have Quotes. Grails response converted some of the 3rd column commas to colons but left the others. SO I found that not all the 3rd column entries were enclosed in quotes. Now i'm wondering if using danielbmartin concept of splitting the line is there a way to cut the LAST 6 fields off of a line. so looking at my samples
i've tried a couple variants of the following and it is not working. I'm sure others here can point out how stupid this looks but i'm very frustrated.
Quote:
sed 's/.*,.*,.*,.*,.*,.*,.*$//'
FYI, we are trying to get the people that have made the data available to us to either make sure there isn't comma separated data in the 3rd column or use a different delimiter. Unfortunately they don't seem to care.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.