LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-05-2008, 05:24 AM   #1
jonnymorris
LQ Newbie
 
Registered: Sep 2008
Posts: 10

Rep: Reputation: 0
Manipulating comma delimited CSV - newbie help with sed etc


Hi,

I have a CSV file with the following (example) format:

Code:
123456,MR K L JONES,126 Credibility Street,Wadsworth,Notts,NT3 PQZ
123457,MRS P KHAN,6 Blah Road,Preston,PR1 9QP
123458,MR I M A PRATT,1 Slime Street,Kellythorpe,Woking,Derbyshire,D2 8TP
123459,MR AWKWARD,56 Park Lane,Leeds,LS1 69Q
This needs to be formatted for each line thus:

RefNumber,Title,Firstname,Secondname,Sirname,Address1,Address2,Address3,Address4,Address5,Postcode

I need to strip out anything after the second name and before the sirname, and pad where there are missing names or address fields with extra commas. I don't know what the appropriate tools are to use (I imagine sed / awk but have no experience of using them), if it can be done such that it is automated then it would save a lot of effort in manual sorting (these files are 1,000+ lines)!

Any help with this is much appreciated! If there is any PC equivalent method (Windows, DOS) that would be even better as that's the system we have to work with here. I am investigating Excel macros and VBA but have no experience of that either so it's a case of "Help!" :}
 
Old 09-05-2008, 05:35 AM   #2
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
For a comma separated record, awk would be a better choice than. You will need to process the second field ($2) taking the last word as the Surname. The first word is the title. I don't know how you want to handle multiple middle initials. If a field is empty, printing out the field and then a comma will do what you want naturally.

A few thousand lines in size is not significant.

There is a "gawk" info file. I installed the gawk source and used the "make pdf" target, because the book it produced from the info file (docbook) source is excellent. There are a number of awk functions that can take a substring of a field and you can assign the surname, middle name, family name and title to variables and then print out the fields in whatever order you want.

Last edited by jschiwal; 09-06-2008 at 03:45 AM. Reason: fixed small typo
 
Old 09-05-2008, 12:54 PM   #3
jonnymorris
LQ Newbie
 
Registered: Sep 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by jschiwal View Post
For a comma separated records, awk would be a better choice than. You will need to process the second field ($2) taking the last word as the Surname. The first word is the title. I don't know how you want to handle multiple middle initials. If a field is empty, printing out the field and then a comma will do what you want naturally.

A few thousand lines in size is not significant.

There is a "gawk" info file. I installed the gawk source and used the "make pdf" target, because the book it produced from the info file (docbook) source is excellent. There are a number of awk functions that can take a substring of a field and you can assign the surname, middle name, family name and title to variables and then print out the fields in whatever order you want.
Thanks for the info on gawk, most useful!

As for implementation, I'm not sure how I would grab the last element as the surname and discard the remaining elements between the second name and the surname, perhaps some kind of counter would be required to save title, first name, second name, then move to the last element and save that. Any ideas?

Also, as this needs to be run on a csv file where only the second comma-separated field need be processed, presumably a similar method of moving to the second element can be used, but I'm pretty novice when it comes to scripting so have little clue of how to achieve this I'm afraid.

I could really do with some detailed and well explained examples of this kind of gawk / awk processing, I find the awk man page confusing. Do you know of any good web pages? I will be hunting for some myself after posting this.

Last edited by jonnymorris; 09-05-2008 at 12:56 PM.
 
Old 09-05-2008, 02:32 PM   #4
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Quote:
Originally Posted by jonnymorris View Post

I could really do with some detailed and well explained examples of this kind of gawk / awk processing, I find the awk man page confusing. Do you know of any good web pages? I will be hunting for some myself after posting this.
Really good tutorials here---awk, sed, regexs, and more:
http://www.grymoire.com/Unix/
 
Old 09-05-2008, 03:12 PM   #5
jiml8
Senior Member
 
Registered: Sep 2003
Posts: 3,171

Rep: Reputation: 114Reputation: 114
Is there a particular reason you want to do this as a shell script? This parsing would be absolutely trivial in PHP, using the explode command.
 
Old 09-06-2008, 02:27 AM   #6
jonnymorris
LQ Newbie
 
Registered: Sep 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by jiml8 View Post
Is there a particular reason you want to do this as a shell script? This parsing would be absolutely trivial in PHP, using the explode command.
I'm probably jumping the gun a bit in stating what tools I might want to use
I do not know any PHP, however if someone wishes to show me how simple it would be then I'm all ears The problem is a bit more complicated than merely splitting a field, it has to be split into Title, Firstname, Secondname, Surname and all values between Secondname and Surname removed. There are also special cases where there is no title, no first or second name, fields containing '&' should probably not be processed but removed to a separate file.
 
Old 09-06-2008, 04:04 AM   #7
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
The info file is much better than the manpage. I downloaded the source.

If you download the package source, you can run
Code:
tar xvzf gawk-<version>.tar.gz
cd gawk-<version>/
./configure
make pdf
cp doc/gawk.pdf ~/Documents/
The first edition of the O'Reilly "Sed & Awk" book might be available on the web as well.

Using gawk, look at the split function to decompose the second field into an array. Then use the last element as the family name, and the first element as title, the second as the first name. Between the second and last elements exclusive, I imagine would be the middle name(s).
You can assign the array elements to variables, or use them in your print command directly.

Here is an example taking your posted sample and printing out the title and last names:
Code:
awk -F, '{ split($2,name," "); print "title: " name[1] "\tsurname: " name[length(name)] }' demo.txt
title: MR   surname: JONES
title: MRS  surname: KHAN
title: MR   surname: PRATT
title: MR   surname: AWKWARD
Here is an example that needs debugging:
Code:
awk -F, '{ split($2,name," "); title=name[1]; first=name[2]; surname=name[length(name)]; middle=""; for (i=3;i<length(name);i++) { middle=middle name[i] }; print title","first","middle","surname }' demo.txt
MR,K,L,JONES
MRS,P,,KHAN
MR,I,MA,PRATT
MR,AWKWARD,,AWKWARD
You may want to create a file with the gawk program. Then each awk command can be on a line. The third line didn't work and the last line where there wasn't a middle name or names. Like I said, it needs debugging but you get an idea how to decompose the name field in the original to separate fields.

Last edited by jschiwal; 09-06-2008 at 04:47 AM.
 
Old 09-06-2008, 11:54 AM   #8
jiml8
Senior Member
 
Registered: Sep 2003
Posts: 3,171

Rep: Reputation: 114Reputation: 114
Quote:
Originally Posted by jonnymorris View Post
I'm probably jumping the gun a bit in stating what tools I might want to use
I do not know any PHP, however if someone wishes to show me how simple it would be then I'm all ears The problem is a bit more complicated than merely splitting a field, it has to be split into Title, Firstname, Secondname, Surname and all values between Secondname and Surname removed. There are also special cases where there is no title, no first or second name, fields containing '&' should probably not be processed but removed to a separate file.
If you have that kind of problem, you are probably well past what you should be doing in the shell.

You are going to need to set up and test a lot of rules to identify and process the various cases you have set forth here. Good luck with that; it is a very non-trivial exercise.
 
Old 09-06-2008, 02:00 PM   #9
jonnymorris
LQ Newbie
 
Registered: Sep 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Cool

Quote:
Originally Posted by jschiwal View Post
Here is an example taking your posted sample and printing out the title and last names:
Code:
awk -F, '{ split($2,name," "); print "title: " name[1] "\tsurname: " name[length(name)] }' demo.txt
title: MR   surname: JONES
title: MRS  surname: KHAN
title: MR   surname: PRATT
title: MR   surname: AWKWARD
Here is an example that needs debugging:
Code:
awk -F, '{ split($2,name," "); title=name[1]; first=name[2]; surname=name[length(name)]; middle=""; for (i=3;i<length(name);i++) { middle=middle name[i] }; print title","first","middle","surname }' demo.txt
MR,K,L,JONES
MRS,P,,KHAN
MR,I,MA,PRATT
MR,AWKWARD,,AWKWARD
Thank you very much! Examples are what I needed, without those it's difficult to know what gawk actions I should be using I'm sure I can fiddle with these and learn how they work, hopefully I can sort out most of the problem. Thanks again!

EDIT: I just tried the 'make pdf' part and discovered that I'm supposed to have TeX installed, but this appears to be a 1.15GB download - WHAT THE??!! Just what is in that kind of a distribution?? Erm, I don't have that kind of bandwidth. Guess I'll be looking for that O'Reilly book instead then.

EDIT EDIT: I discovered that there is a cut-down option of the TeX distribution, BasicTex (40MB), now that's more like it.

Last edited by jonnymorris; 09-06-2008 at 04:26 PM.
 
Old 09-06-2008, 06:36 PM   #10
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,151

Rep: Reputation: 331Reputation: 331Reputation: 331Reputation: 331
Here's a slightly more complex example illustrating some gawk features. Note that, rather than the CSV arrays to specified for the name and address strings, I opted to use slash separated values. Those could, obviously, be made into CSV strings by changing the " / " to a ", " where the parsing is done. I just thought that (for example purposes) the slash worked better.

Anyhow, here's my code:
PHP Code:
#!/bin/gawk                        
BEGIN {                            
  
FS="[[:space:]]*,[[:space:]]*";  
}                                  
RefNum=$1;                       
# Get the title, name and surname of the record
  
RawName=$2;                                  
  
ns=split(RawName,sName," ");                 
  if (
ns 1) {                                
    
Title=sName[1];                            
    
Name="";                                   
    
Surname=sName[ns];                         
    if (
ns 2) {                              
      for (
i=2i<ns; ++i) {                   
        
Name=Name sName[i];                    
        if (
ns-1Name=Name " / "          
      
}                                        
    }                                          
    else {                                     
      
Title="";                                
      
Surname=RawName;                         
    }                                          
  }                                            
# Get the address and postal code of the record
  
Address="";                                  
  for (
i=3NF; ++i) {                     
    
Address Address $i;                      
    if (
NF 1Address Address " / ";   
  }                                            
  
PostCode=$NF;                                
  
ID[RefNum]=RefNum;                           
  
Record[RefNum,"Title"]=Title;                
  
Record[RefNum,"Name"]=Name;                  
  
Record[RefNum,"Surname"]=Surname;            
  
Record[RefNum,"Address"]=Address;            
  
Record[RefNum,"PostCode"]=PostCode;          
}
# Print the records
END {
  for (
i in ID) {
    print 
" "
    
print "Record " i;
    print 
"   Name:      " Record[i,"Title"" " Record[i,"Surname"] (Record[i,"Name"]!="" ", " Record[i,"Name"] : "");
    print 
"   Address:   " Record[i,"Address"];
    print 
"   Post Code: " Record[i,"PostCode"];
  }

and the output it produces:
Code:
$ gawk -f parse.awk text

Record 123456
   Name:      MR JONES, K / L
   Address:   126 Credibility Street / Wadsworth / Notts
   Post Code: NT3 PQZ

Record 123457
   Name:      MRS KHAN, P
   Address:   6 Blah Road / Preston
   Post Code: PR1 9QP

Record 123458
   Name:      MR PRATT, I / M / A
   Address:   1 Slime Street / Kellythorpe / Woking / Derbyshire
   Post Code: D2 8TP

Record 123459
   Name:       MR AWKWARD
   Address:   56 Park Lane / Leeds
   Post Code: LS1 69Q
You may have no need to actually store the records in memory, so all that "record[x,y]" stuff may be irrelevant. But, again, I was just trying to illustrate some possibilities. (For example, that ID[] array could be used to detect duplicated record numbers: if (RecNum in ID) {error processing})
 
Old 09-17-2008, 09:10 AM   #11
jonnymorris
LQ Newbie
 
Registered: Sep 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Hi,

I've modified the END part to format the output for each record on one line and use "," as the field separator, however I'm getting an error 'unexpected newline or end of string'

Code:
END { 
  for (i in ID) { 
    print i "," Record[i,"Title"] "," Record[i,"Name"]!="" ? "," Record[i,"Name"] "," Record[i,"Address"] "," Record[i,"PostCode"] 
} 
}
Any ideas?
 
Old 09-17-2008, 01:54 PM   #12
jan61
Member
 
Registered: Jun 2008
Posts: 235

Rep: Reputation: 46
Moin,

Quote:
Originally Posted by jonnymorris View Post
...however I'm getting an error 'unexpected newline or end of string'

Code:
END { 
  for (i in ID) { 
    print i "," Record[i,"Title"] "," Record[i,"Name"]!="" ? "," Record[i,"Name"] "," Record[i,"Address"] "," Record[i,"PostCode"] 
} 
}
You use the short if-then-else construct condition?action:action but there's no else-part (..:action). At least you should add a :"" if you don't want to output anything. And last I would add braces around this construct to define start and end.

Jan
 
Old 09-17-2008, 07:15 PM   #13
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,287

Rep: Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034
I'd go with jiml8; if you've got all those exceptions/special cases I'd use a different lang, in my case Perl. Text munging is one of its strengths.
He's also right that you need to carefully define the rules to handle all those special cases before you start.
 
Old 09-18-2008, 11:04 AM   #14
jonnymorris
LQ Newbie
 
Registered: Sep 2008
Posts: 10

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by jan61 View Post
Moin,

You use the short if-then-else construct condition?action:action but there's no else-part (..:action). At least you should add a :"" if you don't want to output anything. And last I would add braces around this construct to define start and end.

Jan
Thank you for the suggestions, but no matter what I do, what kind of loop I use, I still get the same error 'unexpected newline or end of string'
 
Old 09-18-2008, 12:56 PM   #15
jan61
Member
 
Registered: Jun 2008
Posts: 235

Rep: Reputation: 46
Moin,

I'd assume you have a syntax error at another part of your script. Did you try the script without the END section?

Jan
 
  


Reply

Tags
csv, file, manipulation, text


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Comparing two csv files and write different record in third CSV file irfanb146 Linux - Newbie 3 06-30-2008 09:15 PM
separating a comma delimited line mrobertson Programming 7 07-27-2005 01:56 PM
Sed(?); Appending a comma-delineated file ice_hockey Linux - General 2 05-27-2005 08:42 AM
Comma-Delimited Website Filenames Apocalypse General 1 11-09-2003 09:05 AM
comma delimited file cdragon Programming 5 06-21-2002 07:55 PM


All times are GMT -5. The time now is 06:04 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration