Manipulating comma delimited CSV - newbie help with sed etc
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Manipulating comma delimited CSV - newbie help with sed etc
Hi,
I have a CSV file with the following (example) format:
Code:
123456,MR K L JONES,126 Credibility Street,Wadsworth,Notts,NT3 PQZ
123457,MRS P KHAN,6 Blah Road,Preston,PR1 9QP
123458,MR I M A PRATT,1 Slime Street,Kellythorpe,Woking,Derbyshire,D2 8TP
123459,MR AWKWARD,56 Park Lane,Leeds,LS1 69Q
I need to strip out anything after the second name and before the sirname, and pad where there are missing names or address fields with extra commas. I don't know what the appropriate tools are to use (I imagine sed / awk but have no experience of using them), if it can be done such that it is automated then it would save a lot of effort in manual sorting (these files are 1,000+ lines)!
Any help with this is much appreciated! If there is any PC equivalent method (Windows, DOS) that would be even better as that's the system we have to work with here. I am investigating Excel macros and VBA but have no experience of that either so it's a case of "Help!" :}
For a comma separated record, awk would be a better choice than. You will need to process the second field ($2) taking the last word as the Surname. The first word is the title. I don't know how you want to handle multiple middle initials. If a field is empty, printing out the field and then a comma will do what you want naturally.
A few thousand lines in size is not significant.
There is a "gawk" info file. I installed the gawk source and used the "make pdf" target, because the book it produced from the info file (docbook) source is excellent. There are a number of awk functions that can take a substring of a field and you can assign the surname, middle name, family name and title to variables and then print out the fields in whatever order you want.
Last edited by jschiwal; 09-06-2008 at 03:45 AM.
Reason: fixed small typo
For a comma separated records, awk would be a better choice than. You will need to process the second field ($2) taking the last word as the Surname. The first word is the title. I don't know how you want to handle multiple middle initials. If a field is empty, printing out the field and then a comma will do what you want naturally.
A few thousand lines in size is not significant.
There is a "gawk" info file. I installed the gawk source and used the "make pdf" target, because the book it produced from the info file (docbook) source is excellent. There are a number of awk functions that can take a substring of a field and you can assign the surname, middle name, family name and title to variables and then print out the fields in whatever order you want.
Thanks for the info on gawk, most useful!
As for implementation, I'm not sure how I would grab the last element as the surname and discard the remaining elements between the second name and the surname, perhaps some kind of counter would be required to save title, first name, second name, then move to the last element and save that. Any ideas?
Also, as this needs to be run on a csv file where only the second comma-separated field need be processed, presumably a similar method of moving to the second element can be used, but I'm pretty novice when it comes to scripting so have little clue of how to achieve this I'm afraid.
I could really do with some detailed and well explained examples of this kind of gawk / awk processing, I find the awk man page confusing. Do you know of any good web pages? I will be hunting for some myself after posting this.
Last edited by jonnymorris; 09-05-2008 at 12:56 PM.
I could really do with some detailed and well explained examples of this kind of gawk / awk processing, I find the awk man page confusing. Do you know of any good web pages? I will be hunting for some myself after posting this.
Is there a particular reason you want to do this as a shell script? This parsing would be absolutely trivial in PHP, using the explode command.
I'm probably jumping the gun a bit in stating what tools I might want to use
I do not know any PHP, however if someone wishes to show me how simple it would be then I'm all ears The problem is a bit more complicated than merely splitting a field, it has to be split into Title, Firstname, Secondname, Surname and all values between Secondname and Surname removed. There are also special cases where there is no title, no first or second name, fields containing '&' should probably not be processed but removed to a separate file.
The info file is much better than the manpage. I downloaded the source.
If you download the package source, you can run
Code:
tar xvzf gawk-<version>.tar.gz
cd gawk-<version>/
./configure
make pdf
cp doc/gawk.pdf ~/Documents/
The first edition of the O'Reilly "Sed & Awk" book might be available on the web as well.
Using gawk, look at the split function to decompose the second field into an array. Then use the last element as the family name, and the first element as title, the second as the first name. Between the second and last elements exclusive, I imagine would be the middle name(s).
You can assign the array elements to variables, or use them in your print command directly.
Here is an example taking your posted sample and printing out the title and last names:
You may want to create a file with the gawk program. Then each awk command can be on a line. The third line didn't work and the last line where there wasn't a middle name or names. Like I said, it needs debugging but you get an idea how to decompose the name field in the original to separate fields.
I'm probably jumping the gun a bit in stating what tools I might want to use
I do not know any PHP, however if someone wishes to show me how simple it would be then I'm all ears The problem is a bit more complicated than merely splitting a field, it has to be split into Title, Firstname, Secondname, Surname and all values between Secondname and Surname removed. There are also special cases where there is no title, no first or second name, fields containing '&' should probably not be processed but removed to a separate file.
If you have that kind of problem, you are probably well past what you should be doing in the shell.
You are going to need to set up and test a lot of rules to identify and process the various cases you have set forth here. Good luck with that; it is a very non-trivial exercise.
Thank you very much! Examples are what I needed, without those it's difficult to know what gawk actions I should be using I'm sure I can fiddle with these and learn how they work, hopefully I can sort out most of the problem. Thanks again!
EDIT: I just tried the 'make pdf' part and discovered that I'm supposed to have TeX installed, but this appears to be a 1.15GB download - WHAT THE??!! Just what is in that kind of a distribution?? Erm, I don't have that kind of bandwidth. Guess I'll be looking for that O'Reilly book instead then.
EDIT EDIT: I discovered that there is a cut-down option of the TeX distribution, BasicTex (40MB), now that's more like it.
Last edited by jonnymorris; 09-06-2008 at 04:26 PM.
Here's a slightly more complex example illustrating some gawk features. Note that, rather than the CSV arrays to specified for the name and address strings, I opted to use slash separated values. Those could, obviously, be made into CSV strings by changing the " / " to a ", " where the parsing is done. I just thought that (for example purposes) the slash worked better.
Anyhow, here's my code:
PHP Code:
#!/bin/gawk
BEGIN {
FS="[[:space:]]*,[[:space:]]*";
}
{ RefNum=$1;
# Get the title, name and surname of the record
RawName=$2;
ns=split(RawName,sName," ");
if (ns > 1) {
Title=sName[1];
Name="";
Surname=sName[ns];
if (ns > 2) {
for (i=2; i<ns; ++i) {
Name=Name sName[i];
if (i < ns-1) Name=Name " / "
}
}
else {
Title="";
Surname=RawName;
}
}
# Get the address and postal code of the record
Address="";
for (i=3; i < NF; ++i) {
Address = Address $i;
if (i < NF - 1) Address = Address " / ";
}
PostCode=$NF;
ID[RefNum]=RefNum;
Record[RefNum,"Title"]=Title;
Record[RefNum,"Name"]=Name;
Record[RefNum,"Surname"]=Surname;
Record[RefNum,"Address"]=Address;
Record[RefNum,"PostCode"]=PostCode;
}
# Print the records
END {
for (i in ID) {
print " "
print "Record " i;
print " Name: " Record[i,"Title"] " " Record[i,"Surname"] (Record[i,"Name"]!="" ? ", " Record[i,"Name"] : "");
print " Address: " Record[i,"Address"];
print " Post Code: " Record[i,"PostCode"];
}
}
and the output it produces:
Code:
$ gawk -f parse.awk text
Record 123456
Name: MR JONES, K / L
Address: 126 Credibility Street / Wadsworth / Notts
Post Code: NT3 PQZ
Record 123457
Name: MRS KHAN, P
Address: 6 Blah Road / Preston
Post Code: PR1 9QP
Record 123458
Name: MR PRATT, I / M / A
Address: 1 Slime Street / Kellythorpe / Woking / Derbyshire
Post Code: D2 8TP
Record 123459
Name: MR AWKWARD
Address: 56 Park Lane / Leeds
Post Code: LS1 69Q
You may have no need to actually store the records in memory, so all that "record[x,y]" stuff may be irrelevant. But, again, I was just trying to illustrate some possibilities. (For example, that ID[] array could be used to detect duplicated record numbers: if (RecNum in ID) {error processing})
I've modified the END part to format the output for each record on one line and use "," as the field separator, however I'm getting an error 'unexpected newline or end of string'
Code:
END {
for (i in ID) {
print i "," Record[i,"Title"] "," Record[i,"Name"]!="" ? "," Record[i,"Name"] "," Record[i,"Address"] "," Record[i,"PostCode"]
}
}
...however I'm getting an error 'unexpected newline or end of string'
Code:
END {
for (i in ID) {
print i "," Record[i,"Title"] "," Record[i,"Name"]!="" ? "," Record[i,"Name"] "," Record[i,"Address"] "," Record[i,"PostCode"]
}
}
You use the short if-then-else construct condition?action:action but there's no else-part (..:action). At least you should add a :"" if you don't want to output anything. And last I would add braces around this construct to define start and end.
I'd go with jiml8; if you've got all those exceptions/special cases I'd use a different lang, in my case Perl. Text munging is one of its strengths.
He's also right that you need to carefully define the rules to handle all those special cases before you start.
You use the short if-then-else construct condition?action:action but there's no else-part (..:action). At least you should add a :"" if you don't want to output anything. And last I would add braces around this construct to define start and end.
Jan
Thank you for the suggestions, but no matter what I do, what kind of loop I use, I still get the same error 'unexpected newline or end of string'
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.