LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 11-21-2008, 12:11 AM   #1
toshibalaptoplinux
LQ Newbie
 
Registered: Nov 2008
Posts: 9

Rep: Reputation: 0
Unhappy Merging Data Files


I would like to merge, sort, and remove duplicate entries from two data files. They are both CSV files. One is an address book export from Jpilot and the other is an address book export from Thunderbird.Later I would also like to deal with a 3rd file which would be an Outlook 2000 export.

When finished I will import the clean data back into my applications. I have over 5000 entries in each file of which a large percentage are duplicates.

My first concern is that there will be discrepancies in the field names so I guess I would have to sort and match the fields before merging the files?

The second is that the data will contain both Japanese and English character sets.

I have little experience with text editing in Linux at the command line and all the googling I have done quite frankly has me lost.

It seems that I would have to use some combination of Sed/Awk, Sort -u, and Cat? But I am really at a loss about how to do this. Cat to merge the files and then Sed/Awk and Sort -u to sort and remove duplicates?

The most important thing for me is to also be able to see what data is deleted. So I rather it not delete stuff silently in the background but generate and output a record of the deleted data as well as the clean data.Of course I will back everything up before I start. SO my question is where do I start and what commands and what kind of syntax should I use?

Thanks
 
Old 11-21-2008, 12:46 AM   #2
kenneho
Member
 
Registered: May 2003
Location: Oslo, Norway
Distribution: Ubuntu, Red Hat Enterprise Linux
Posts: 655

Rep: Reputation: 40
Hi.


Cleaning data like this can be very complex. One thing is the technical issues (which commands to use and so forth), but things such as deciding what constitues a match between records must be taken into account:
  • Matching: How do you define a match between records - does the names need to be exactly matched, or should f.eks "John Smith" match to "John F. Smith"? How alike must the fields be before you're sure that it's a match? How many fields must be alike before you're sure it's a match?
  • Surviving: Which fields should survive from a two matched records (should "John F. Smith" or "John Smith" make it to the final record)

Maybe you've allready developed a matching scheme, but if not you might want to consider my bullets above. This was not the very techincal answer you were asking for, but thought it should be commented.
 
Old 11-21-2008, 05:49 PM   #3
PAix
Member
 
Registered: Jul 2007
Location: United Kingdom, W Mids
Distribution: SUSE 11.0 as of Nov 2008
Posts: 195

Rep: Reputation: 40
By definition, you have excluded from helping you anyone that is not familiar by the detail of the three export formats you mentioned. I am assuming that they are each different in some significant manner, otherwise you wouldn't have needed to mention them each explicitly. This might just be why you have had no meaningful reply to assist you. You can't expect your helpers to do all the running for you.

I would decide upon a common convenient all-embracing format for the files to be individually written to, with at least a field to indicate which of the three applications provided the data. My favourite tool for this is AWK; read each of the original CSV files (one AWK script for each - clone and modify). Write the data out in the common format. If done right, this should aline like fields.

Unique names of individuals might be represented in several different forms, even within the same email client's export file, as Kenneho pointed out. This isn't such a hot candidate for the key to everything therefore.

Once you have the three (or more if you are very ambitious) files written in the common format, with unique field delimiters of course (new CSV file) along with as many extra fields as you feel might be helpful and ease your task during further processing.

You can now concatenate the files into one new large file, checking carefully that the format is truly consistent with no fielding errors. If it's not correct, you can back track and fix the script problems before proceeding. I might expect some fields will exist where there are missing elements of data, but this is not likely to be a problem.

The email address field is prime candidate for being turned into your main key, but it might also contain all sorts of additional crud depending on the whim of the application and person that originally setup the data.

So you might like to write away the original field and a cleaned up version of the field.

Rule 1, never throw anything away in case you need it.

in just one email client you could for example have
Quote:
"Robert Jackson" <jcksnr296@abc.com>
<jcksnr296@abc.com>
"R Jackson" <jcksnr296@abc.com>
So neatly preserved, you can process the second copy of the field and write away
Quote:
<jcksnr296@abc.com>
With or without the angle brackets which you would use as as delimiting the real email address. This might give you for example three duplicate records that can be flagged for reporting.

You might choose to report all found duplicates so that you can manually select which ones to promote to to your final import file, the remainer being written to the "didn't make it file".

So now you have this super file of just the records you want, because you removed the duplicate flag on the duplicates you wished to promote.

With this bespoke master file, you might now wish to rewrite it to produce the import CSV files in the original format dropping unnecessary fields that are no longer required.

Doing it in stages probably increases focus at any given moment in the process so mistakes are less likely to go unnoticed and make it into the final cut. The probability is also that the whole shebang will only be used a handful of times at most before ending up forgotten in a dark corner of a disk somewhere, having done it's job. This sort of stuff can be very satisfying.

If it was to be used anything like regularly, documentation would be a good idea and you might even choose to tie it all together with a Bash script.

Tools: Bash, Awk, sort and a few other bits and pieces. The sort only being used to order the data conveniently and let AWK take care of writing all the data to wanted and unwanted files, as per Rule 1. Then you know exactly where you are.

Others would do the same job employing sed, perl or other scripting languages, each because that's the way they like to do things.

I hope I have given you sufficient to either make a start with the task or at least given you enough to ask the right questions of other, cleverer people than myself.

If you need further assistance, I suggest that you provide short, comprehensive, sanitised (not real data, but fully representative) examples of the three formats you are interested in processing. That way you are providing the running track for the runners and it will be that much easier all round.

Good luck.
 
Old 12-11-2008, 02:24 AM   #4
toshibalaptoplinux
LQ Newbie
 
Registered: Nov 2008
Posts: 9

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by kenneho View Post
Hi.


Cleaning data like this can be very complex. One thing is the technical issues (which commands to use and so forth), but things such as deciding what constitues a match between records must be taken into account:
  • Matching: How do you define a match between records - does the names need to be exactly matched, or should f.eks "John Smith" match to "John F. Smith"? How alike must the fields be before you're sure that it's a match? How many fields must be alike before you're sure it's a match?
  • Surviving: Which fields should survive from a two matched records (should "John F. Smith" or "John Smith" make it to the final record)

Maybe you've allready developed a matching scheme, but if not you might want to consider my bullets above. This was not the very techincal answer you were asking for, but thought it should be commented.
Thank you very much for your reply.

I have given a lot of thought to what constitutes a match and will come to that hurdle in due time.

My most immediate concern now is that I know that there are records that are EXACT matches across all fields and the data that populates them.If I can start by isolating and getting rid of those first it will be a huge start to cleaning my data.
 
Old 12-11-2008, 02:42 AM   #5
toshibalaptoplinux
LQ Newbie
 
Registered: Nov 2008
Posts: 9

Original Poster
Rep: Reputation: 0
Thumbs up

Thank you very much for you lengthy reply. It is greatly appreciated.It is a lot to digest and has given me much to think about.I have provided more detail in the body of your reply.

Quote:
Originally Posted by PAix View Post
By definition, you have excluded from helping you anyone that is not familiar by the detail of the three export formats you mentioned. I am assuming that they are each different in some significant manner, otherwise you wouldn't have needed to mention them each explicitly. This might just be why you have had no meaningful reply to assist you. You can't expect your helpers to do all the running for you.

<Answer/>The files are not that radically different. There will be some field discrepancies but I think I can match them. Matching the data in them is a different story as you noted.All three files are CSV files exported from Outlook, JPilot, and Thunderbird respectively.I am sorry if I made things more confusing than necessary. I just figured the more details I provided the more likely I would be to get an answer.</Answer>

I would decide upon a common convenient all-embracing format for the files to be individually written to, with at least a field to indicate which of the three applications provided the data. My favourite tool for this is AWK; read each of the original CSV files (one AWK script for each - clone and modify). Write the data out in the common format. If done right, this should aline like fields.

<Answer/>This is where I have to do my homework and was hoping to get help with. I am really SED/AWK illiterate and wouldn't even know how to begin to do this.</Answer>


Unique names of individuals might be represented in several different forms, even within the same email client's export file, as Kenneho pointed out. This isn't such a hot candidate for the key to everything therefore.

<Answer/>Noted</Answer>

Once you have the three (or more if you are very ambitious) files written in the common format, with unique field delimiters of course (new CSV file) along with as many extra fields as you feel might be helpful and ease your task during further processing.

You can now concatenate the files into one new large file, checking carefully that the format is truly consistent with no fielding errors. If it's not correct, you can back track and fix the script problems before proceeding. I might expect some fields will exist where there are missing elements of data, but this is not likely to be a problem.

The email address field is prime candidate for being turned into your main key, but it might also contain all sorts of additional crud depending on the whim of the application and person that originally setup the data.

So you might like to write away the original field and a cleaned up version of the field.

Rule 1, never throw anything away in case you need it.

<Answer/>Without a doubt. Backed up and burned on removable media</Answer>

in just one email client you could for example have
So neatly preserved, you can process the second copy of the field and write away With or without the angle brackets which you would use as as delimiting the real email address. This might give you for example three duplicate records that can be flagged for reporting.
<Answer/>Noted</Answer>

You might choose to report all found duplicates so that you can manually select which ones to promote to to your final import file, the remainer being written to the "didn't make it file".

<Answer/>Noted</Answer>

So now you have this super file of just the records you want, because you removed the duplicate flag on the duplicates you wished to promote.

With this bespoke master file, you might now wish to rewrite it to produce the import CSV files in the original format dropping unnecessary fields that are no longer required.

<Answer/>Noted</Answer>

Doing it in stages probably increases focus at any given moment in the process so mistakes are less likely to go unnoticed and make it into the final cut. The probability is also that the whole shebang will only be used a handful of times at most before ending up forgotten in a dark corner of a disk somewhere, having done it's job. This sort of stuff can be very satisfying.
<Answer/>Noted</Answer>

If it was to be used anything like regularly, documentation would be a good idea and you might even choose to tie it all together with a Bash script.
<Answer/>Noted</Answer>

Tools: Bash, Awk, sort and a few other bits and pieces. The sort only being used to order the data conveniently and let AWK take care of writing all the data to wanted and unwanted files, as per Rule 1. Then you know exactly where you are.

<Answer/>This is where I will need the most assistance and will try yopost concise questions</Answer>

Others would do the same job employing sed, perl or other scripting languages, each because that's the way they like to do things.

I hope I have given you sufficient to either make a start with the task or at least given you enough to ask the right questions of other, cleverer people than myself.

If you need further assistance, I suggest that you provide short, comprehensive, sanitised (not real data, but fully representative) examples of the three formats you are interested in processing. That way you are providing the running track for the runners and it will be that much easier all round.
<Answer/>Thank you VERY MUCH. It has absolutely given me a starting point and will allow me to post more precise questions.</Answer>

Good luck.
 
Old 12-12-2008, 09:52 AM   #6
PAix
Member
 
Registered: Jul 2007
Location: United Kingdom, W Mids
Distribution: SUSE 11.0 as of Nov 2008
Posts: 195

Rep: Reputation: 40
Hi Tosh',
The all embracing format isn't anything to be concerned about of course, it just makes sure that there is a suitable pigeon hole for every bit of information from all your input file formats.
I couldn't honestly describe to you the output formats of Outlook, JPilot or Thunderbird and honestly have no interest or motivation to research them - welcome to hard truths.

If you publish a representative two or three lines from each of the CSV files you are interested in (change real data to preserve privacy, but please change it realistically); Outlook, JPilot and Thunderbird and let me know what the fields are called, I will knock up an Aunt Sally for you to play with. Short samples are inclined to be less than truly representative and cause the imagination to work overtime to anticipate data combinations that may never actually appear in your data - blossoming like a cherry tree.

Your move I believe. Regards,
 
Old 12-14-2008, 11:32 PM   #7
toshibalaptoplinux
LQ Newbie
 
Registered: Nov 2008
Posts: 9

Original Poster
Rep: Reputation: 0
Unhappy

Quote:
Originally Posted by PAix View Post
Hi Tosh',
The all embracing format isn't anything to be concerned about of course, it just makes sure that there is a suitable pigeon hole for every bit of information from all your input file formats.
I couldn't honestly describe to you the output formats of Outlook, JPilot or Thunderbird and honestly have no interest or motivation to research them - welcome to hard truths.

If you publish a representative two or three lines from each of the CSV files you are interested in (change real data to preserve privacy, but please change it realistically); Outlook, JPilot and Thunderbird and let me know what the fields are called, I will knock up an Aunt Sally for you to play with. Short samples are inclined to be less than truly representative and cause the imagination to work overtime to anticipate data combinations that may never actually appear in your data - blossoming like a cherry tree.

Your move I believe. Regards,
Once again thank you for your time and the assistance. It is greatly appreciated. I am really trying to learn this stuff as I go along so I will eventually be a little more self sufficient.

I have provided the examples that you requested for the Thunderbird and JPilot outputs. I will forgo the Outlook data for now for simplicity sake.

There are only a couple things I would note. 1st is that there will be both English and Japanese (kanji, katakana, hiragana) data populating both data sets. But I am pretty sure it is all UTF-8 so I don't think presents a problem? If it isn't all UTF-8 I think I can convert it all but I will cross that bridge later if necessary.

2nd is that there may be multiple phone number formats. xxx-xxx-xxxx, (xxx)xxx-xxxx, xxx-xxxx-xxxx, xxxxxxxxxxx. I think this is something I will have to deal with on my own but mention it in the event it adds another layer of complexity to my problem.

3rd is that there are instances of email data in "Custom" fields too so would like to be able to search that as well as the normal "email" fields. ie: email@domain.com

As I said before, although it would be nice, I am not looking to have someone do all the work for me. I am willing to learn and try as much as I can myself but I really am lost in SED/AWK. If I can find a way to get exact duplicates out to begin with that is a good start for me.

These are the fields for the JPilot output and some sample data:

CSV address version 1.6.0: Category, Private, Last name, First name, Title, Company, Phone 0 label, Phone 0, Phone 1 label, Phone 1, Phone 2 label, Phone 2, Phone 3 label, Phone 3, Phone 4 label, Phone 4, Address, City, State, Zip Code, Country, Custom 1, Custom 2, Custom 3, Custom 4, Note, Show in List


"Unfiled","0","Smith","John","","","Work","212-555-1212","E-mail","email@domain.com","Other","718-555-1212","Other","","E-mail","","","Maui","Hawaii","","United States of America","","","","","","0"

"携帯","0","さとう","たかよ","","","Other","09012123434","E-mail","email@docomo.ne.jp","Fax","","Mobile","","Other","","","","","","","","","","","","1"

"Unfiled","0","Smith","Jane","","","Home","(718) 555-1212","E-mail","email@aol.com","Mobile","917-555-1212","E-mail","email@yahoo.com","Pager","","","","","","","","","","","These are notes in alphanumeric

","0"

These are the fields for the Thunderbird output and some sample data:

First Name,Last Name,Display Name,Nickname,Primary Email,Secondary Email,Work Phone,Home Phone,Fax Number,Pager Number,Mobile Number,Home Address,Home Address 2,Home City,Home State,Home ZipCode,Home Country,Work Address,Work Address 2,Work City,Work State,Work ZipCode,Work Country,Job Title,Department,Organization,Web Page 1,Web Page 2,Birth Year,Birth Month,Birth Day,Custom 1,Custom 2,Custom 3,Custom 4,Notes,

AAA,Office,AAA Branch Office,,Unknown,,(212) 468-2600 IDP's only,,,,,,,,,,,1881 Broadway at 62nd Street,,New York,New York,,United States of America,,,AAA Branch Office,,,,,,,,,,"Mon. - Fri. 8:45 a.m. - 5:30 p.m. Sat. 9:00 a.m. - 5:00 p.m. Branch Offices are closed Sundays and the following holidays: Martin Luther King Day, Memorial Day, Independence Day, Labor Day, Thanksgiving, Christmas and New Year's Day. ",

Wakimo,Chiada,"Wakimo, Chiada",,wemail@yahoo.com,email@yahoo.co.jp,,,,,080-1212-3434,,,,,,,,,,,,Japan,,,,,,,,,,,,,,

美奈,せねき,せねき 美奈,,email@za.cyberhome.ne.jp,email@mac.com,,,,,09012123434,,,,,,,,,,,,,,,,,,,,,,,,,,
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Merging Two Files using C++ ckoniecny Programming 5 09-26-2006 09:00 AM
merging movie files ZaphyR Linux - Software 4 09-05-2004 08:26 AM
merging files using perl pantera Programming 1 06-03-2004 12:56 PM
merging log files help please digitalgravy Linux - Newbie 3 12-10-2003 02:26 PM
WP for cmd-line data merging & printing? MikHud Linux - Software 0 11-06-2002 02:36 AM


All times are GMT -5. The time now is 05:50 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration