[SOLVED] bash/sed/awk to convert comma's not in quotes in a line with many comma's
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Here's a gawk script that might do the job for you. Note that this is gawk specific, since it uses a non-posix awk extension to the match() function.
Code:
BEGIN {
# Define C as the "csv" file field delimiter. (Usually a comma, but ...)
C=","
# Define D as the delimiter with which to replace the input delimiter
D=":"
# Define Q as a string containing a single quote (Not really necessary)
Q="\""
}
# Main loop. Note the only $0 (the whole input line) is used.
# I.e., FS is not needed.
{
# Break the input into parts[1]"part[2]"part[3] where
# part[[1] ends with a C, and part[3] starts with a C.
# (Note tha the latter condition precludes the last field being
# the only quoted string in the line.)
np=match($0, "(.*" C ")" Q "([^" Q "]*)" Q "(" C ".*)", part)
# Did we find a paired quote?
if (np) {
# # Yes, we did. Replace any C by a D in the part preceeding
# the first quote and the part following the matching quote. (Note
# that this ignores any quotes following the first matched quote.)
gsub(C,D,part[1])
gsub(C,D,part[3])
Out=part[1] Q part[2] Q part[3]
}
else {
# If here, no matched quotes in the line. Replace all C characters
# the new delimiter.
Out=$0
gsub(C,D,Out)
}
# Testing output
printf("\nNR=%d\n Input: '%s'\nOutput: '%s'\n",NR,$0,Out)
# Output to "out.csv" (Example)
# print Out > "out.csv"
}
The problem comes that there is no standard formatting in the 3rd or subsequent columns. there is NO input validation done or QC on the data. I had believed that the 3rd column did have quotes around it but after a few of these examples were tried i found out differently. there can be 0-* entries in the 3rd column, the most i have actually found is 6. It is about 30k lines due to the way it was done.
All that is guaranteed is that the third column starts after the second comma and there are 6 commas after the 3rd column data. I'm not sure how the import function on the spreadsheet reliable breaks the 3rd column data out of the line but it does appear to.
192.168.1.1,[Authoritative|Recursive],{total mishmash of data here including possible commas, semi-colons, quotes, or even "TBD"},,,,,,
I will be trying the gawk and other options shortly. Unfortunately i don't have access to both the linux (segregated network) and windows (internet enabled) systems at the same time.
I really appreciate the help, I have learned quite a bit.
Now the only headache is from the idiots that used lf/cr in the 3rd column. The entry in the file looks like:
123.456.789.100,Authoritative,”blah.com
blah.biz
blah.gov”,filler,,,,
Combining ideas from www.linuxhowtos.org, LQ member firstfire, and my own fevered imagination, I came up with a way to "clean up" a file which has these unwanted NewLine characters. It relies on the premise that every legitimate line begins with two numerics.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.