LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 07-17-2016, 11:15 AM   #1
Anobodyinok
LQ Newbie
 
Registered: Jul 2016
Posts: 2

Rep: Reputation: Disabled
Post Data Processing / Data Mining?


I have a contacts list with approximately 1400 entries.

What I need to do is determine "whos related/shares contact info phone/address with who", or in other words... I need to parse these contacts in a manner that shows which ones share the same contact details, like an address or a phone number. I know I could do it "manually" with kaddressbook, but I want something more robust that puts everything infront of me.

My solution could be a stand alone application, script, or web based (LAMP) app/script.

I am running Debian and have LAMP setup.

Anyone have any ideas on how to accomplish this?
 
Old 07-17-2016, 11:52 AM   #2
cliffordw
Member
 
Registered: Jan 2012
Location: South Africa
Posts: 509

Rep: Reputation: 203Reputation: 203Reputation: 203
Hi there, and welcome!

For starters, where is the list now - in kaddressbook, or in some other app/file?

The best solution would depend on the format of the data, and on which scripting/programming tools/languages you are most comfortable.

One approach might be to export the data to CSV/vCard/some other text-based format, and then write a script to do a comparison. If this is a once off exercise, you could probably find the duplicates with tools like grep or cut to extract the fields you want, and sort & uniq to find duplicates. For a more permanent tool there are probably better options, though, depending on your skills.

Good luck!
 
Old 07-17-2016, 12:17 PM   #3
Anobodyinok
LQ Newbie
 
Registered: Jul 2016
Posts: 2

Original Poster
Rep: Reputation: Disabled
Thank You for your response. Currently I am using kaddressbook and Evolution. I am most comfortable with php, but that's about the extent of my programming skills.

I guess what am hoping for is to be able to list which contacts share a data-point, and analysis of such, if possible. I am playing with a few ladp web based apps, but nothing yet that does what I want, at least out of the box.

Could grep be used to find relations of data records? I assumed I would still have to input an initial value to find it's match, much like a general "search" algorithm.



Thanks again!
 
Old 07-17-2016, 09:32 PM   #4
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
You have a slew of options ... including both databases (even SQLite "files"), and the ubiquitous spreadsheet.

Although "1,400 records" is daunting for a human being, it's child's play for a computer. Quite frankly, I'd load the data into a spreadsheet (say, OpenOffice, or, dare I say it, Microsoft Excel ...), and, on additional "notebook pages," begin looking for commonality. You might wish to, for instance, sort the records by a particular column or set of columns and then simply scroll through them, looking for groups of identical or nearly-identical information (since sorting places these records adjacent to one another).

"Keep it simple. Very simple." You don't need to write a LAMP web-site. You probably won't even have to write a script. Your spreadsheet tool already possesses database connectivity, but, with "only" 1,400 records in play, I'm not entirely sure I'd bother.
 
2 members found this post helpful.
Old 07-18-2016, 08:54 AM   #5
cliffordw
Member
 
Registered: Jan 2012
Location: South Africa
Posts: 509

Rep: Reputation: 203Reputation: 203Reputation: 203
Hi,

Quote:
Originally Posted by Anobodyinok View Post
Could grep be used to find relations of data records? I assumed I would still have to input an initial value to find it's match, much like a general "search" algorithm.
Let's look at duplicate phone numbers as an example. By using "grep" I meant use it to extract the phone numbers (1st step). grep would be a good tool for this if you export the data to vCard / LDIF file(s). Then use "sort" and "uniq" to find all phone numbers that are duplicated.

Once you have the numbers that are duplicated, you can use the number as pattern for "grep" to find the contacts with that number.

As an example, I have a bunch of contacts in individual vcard files. I would do something like this:

Code:
grep '^TEL' *.vcf | cut -d: -f3|sort|uniq -c | sort -rn
This will give me a count (number of occurrences) for each number; where the count is higher than 1, I have a duplicate ;-)

If you export to a CSV file, you could use "cut" to extract the relevant field from the file (no grep required), and then do the same "sort | uniq -c | sort -rn" on that output.

I hope that makes more sense now ;-)

If you want to write something in PHP, I would start out by exporting to CSV, and importing that into an SQLite database as sundialsvcs' suggested. From there it shouldn't be too difficult to write some code to find duplicates. One advantage of such an approach is you could also sanitize the data a little in the PHP code (like standardize the format of phone numbers).
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Pipelining data of find command in an array after processing the contents of the data AshishJogeshwar Linux - Software 5 06-10-2010 01:52 AM
Pipelining data of find command in an array after processing the contents of the data AshishJogeshwar Linux - Software 0 06-07-2010 06:15 AM
Processing data from a 'foreign' database with mysql, or tools to pre-process data. linker3000 Linux - Software 1 08-14-2007 08:36 PM
data mining and data warehousing software A_sim Linux - Software 1 01-16-2006 09:25 PM
Home Office Biotech Data Mining - Data Collection Adler Linux - General 20 11-03-2004 04:17 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 12:21 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration