I got a folder containing 20,0000 files (and will be more later), each of this file contains a single URL.
What I am trying to do is to do statistics on number of duplicated URLs (file with same content) accumulated, say, from September to October by each day.
For example, if file A was created in Sep 1st has a same content with file B created in Sep 15th, then we add 1 to number of duplication on Sep 15th (not Sep 1st).
Currently I got a way to do it (in perl) as below:
(1) Read all files inside the folder, print out file content, and modified time into a big file in the format:
Code:
[URL] [month] [day]
(2) Sort the file by month then by day
(3)
Then Create two hash : date and content
And then read url from each line into a hash 'content' from the big file (key is url, value left undef), for each new read url found in hash 'content', a duplication is detected, so mark '$date{$month.$day}++'
The algorithm could be working but may take too long... so I am wondering if there is some easier way to do that besides hashes.
Any idea is well appreciated,
Thanks,
-Kun