Data Processing / Data Mining?

Anobodyinok · 07-17-2016, 11:15 AM

I have a contacts list with approximately 1400 entries.

What I need to do is determine "whos related/shares contact info phone/address with who", or in other words... I need to parse these contacts in a manner that shows which ones share the same contact details, like an address or a phone number. I know I could do it "manually" with kaddressbook, but I want something more robust that puts everything infront of me.

My solution could be a stand alone application, script, or web based (LAMP) app/script.

I am running Debian and have LAMP setup.

Anyone have any ideas on how to accomplish this?

cliffordw · 07-17-2016, 11:52 AM

Hi there, and welcome!

For starters, where is the list now - in kaddressbook, or in some other app/file?

The best solution would depend on the format of the data, and on which scripting/programming tools/languages you are most comfortable.

One approach might be to export the data to CSV/vCard/some other text-based format, and then write a script to do a comparison. If this is a once off exercise, you could probably find the duplicates with tools like grep or cut to extract the fields you want, and sort & uniq to find duplicates. For a more permanent tool there are probably better options, though, depending on your skills.

Good luck!

Anobodyinok · 07-17-2016, 12:17 PM

Thank You for your response. Currently I am using kaddressbook and Evolution. I am most comfortable with php, but that's about the extent of my programming skills.

I guess what am hoping for is to be able to list which contacts share a data-point, and analysis of such, if possible. I am playing with a few ladp web based apps, but nothing yet that does what I want, at least out of the box.

Could grep be used to find relations of data records? I assumed I would still have to input an initial value to find it's match, much like a general "search" algorithm.

Thanks again!

sundialsvcs · 07-17-2016, 09:32 PM

You have a slew of options ... including both databases (even SQLite "files"), and the ubiquitous spreadsheet.

Although "1,400 records" is daunting for a human being, it's child's play for a computer. Quite frankly, I'd load the data into a spreadsheet (say, OpenOffice, or, dare I say it, Microsoft Excel ...), and, on additional "notebook pages," begin looking for commonality. You might wish to, for instance, sort the records by a particular column or set of columns and then simply scroll through them, looking for groups of identical or nearly-identical information (since sorting places these records adjacent to one another).

"Keep it simple. Very simple." You don't need to write a LAMP web-site.

You probably won't even have to write a script. Your spreadsheet tool already possesses database connectivity, but, with "only" 1,400 records in play, I'm not entirely sure I'd bother.

cliffordw · 07-18-2016, 08:54 AM

Hi,

Quote:

Originally Posted by Anobodyinok

Could grep be used to find relations of data records? I assumed I would still have to input an initial value to find it's match, much like a general "search" algorithm.

Let's look at duplicate phone numbers as an example. By using "grep" I meant use it to extract the phone numbers (1st step). grep would be a good tool for this if you export the data to vCard / LDIF file(s). Then use "sort" and "uniq" to find all phone numbers that are duplicated.

Once you have the numbers that are duplicated, you can use the number as pattern for "grep" to find the contacts with that number.

As an example, I have a bunch of contacts in individual vcard files. I would do something like this:

Code:

grep '^TEL' *.vcf | cut -d: -f3|sort|uniq -c | sort -rn

This will give me a count (number of occurrences) for each number; where the count is higher than 1, I have a duplicate ;-)

If you export to a CSV file, you could use "cut" to extract the relevant field from the file (no grep required), and then do the same "sort | uniq -c | sort -rn" on that output.

I hope that makes more sense now ;-)

If you want to write something in PHP, I would start out by exporting to CSV, and importing that into an SQLite database as sundialsvcs' suggested. From there it shouldn't be too difficult to write some code to find duplicates. One advantage of such an approach is you could also sanitize the data a little in the PHP code (like standardize the format of phone numbers).