[Perl] Idea on statistics for duplication file contents

Kunsheng · 10-23-2009, 08:57 PM

I got a folder containing 20,0000 files (and will be more later), each of this file contains a single URL.

What I am trying to do is to do statistics on number of duplicated URLs (file with same content) accumulated, say, from September to October by each day.

For example, if file A was created in Sep 1st has a same content with file B created in Sep 15th, then we add 1 to number of duplication on Sep 15th (not Sep 1st).

Currently I got a way to do it (in perl) as below:

(1) Read all files inside the folder, print out file content, and modified time into a big file in the format:

Code:

[URL] [month] [day]

(2) Sort the file by month then by day

(3)
Then Create two hash : date and content

And then read url from each line into a hash 'content' from the big file (key is url, value left undef), for each new read url found in hash 'content', a duplication is detected, so mark '$date{$month.$day}++'

The algorithm could be working but may take too long... so I am wondering if there is some easier way to do that besides hashes.

Any idea is well appreciated,

Thanks,

-Kun

lutusp · 10-24-2009, 02:40 AM

Quote:

Originally Posted by Kunsheng

I got a folder containing 20,0000 files (and will be more later), each of this file contains a single URL.

What I am trying to do is to do statistics on number of duplicated URLs (file with same content) accumulated, say, from September to October by each day.

For example, if file A was created in Sep 1st has a same content with file B created in Sep 15th, then we add 1 to number of duplication on Sep 15th (not Sep 1st).

Currently I got a way to do it (in perl) as below:

(1) Read all files inside the folder, print out file content, and modified time into a big file in the format:

Code:

[URL] [month] [day]

(2) Sort the file by month then by day

(3)
Then Create two hash : date and content

And then read url from each line into a hash 'content' from the big file (key is url, value left undef), for each new read url found in hash 'content', a duplication is detected, so mark '$date{$month.$day}++'

The algorithm could be working but may take too long... so I am wondering if there is some easier way to do that besides hashes.

Any idea is well appreciated,

Thanks,

-Kun

Create an associative array in Perl, Ruby, Python, etc, etc. Have the key be the URL and the value be whatever you care to collect, including a class able to hold various kinds of data associated with the URL. The duplicates can be easily detected and summed because the key (the URL) will already be present in the array after the first occurrence, so you increment a counter representing the number of occurrences.

Any nontrivial scripting language can do this. It's a simple matter of actually doing it.

Quote:

Originally Posted by Kunsheng

I got a folder containing 20,0000 files

I hope you are ready to lose all your data. Never accumulate this many files in a single directory. This is what databases are designed for. In fact, if the URLs were in a database right now, your problems would be solved. -- you would write a few lines of SQL and be done. In fact, that's how I recommend that you do it -- you cold perform all sorts of ingenious manipulations and data searches, in an environment that is designed for this specific kind of task.