LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 10-23-2009, 08:57 PM   #1
Kunsheng
Member
 
Registered: Mar 2009
Posts: 82

Rep: Reputation: 16
[Perl] Idea on statistics for duplication file contents


I got a folder containing 20,0000 files (and will be more later), each of this file contains a single URL.

What I am trying to do is to do statistics on number of duplicated URLs (file with same content) accumulated, say, from September to October by each day.

For example, if file A was created in Sep 1st has a same content with file B created in Sep 15th, then we add 1 to number of duplication on Sep 15th (not Sep 1st).


Currently I got a way to do it (in perl) as below:

(1) Read all files inside the folder, print out file content, and modified time into a big file in the format:
Code:
[URL] [month] [day]
(2) Sort the file by month then by day


(3)
Then Create two hash : date and content

And then read url from each line into a hash 'content' from the big file (key is url, value left undef), for each new read url found in hash 'content', a duplication is detected, so mark '$date{$month.$day}++'

The algorithm could be working but may take too long... so I am wondering if there is some easier way to do that besides hashes.

Any idea is well appreciated,

Thanks,

-Kun
 
Old 10-24-2009, 02:40 AM   #2
lutusp
Member
 
Registered: Sep 2009
Distribution: Fedora
Posts: 835

Rep: Reputation: 102Reputation: 102
Quote:
Originally Posted by Kunsheng View Post
I got a folder containing 20,0000 files (and will be more later), each of this file contains a single URL.

What I am trying to do is to do statistics on number of duplicated URLs (file with same content) accumulated, say, from September to October by each day.

For example, if file A was created in Sep 1st has a same content with file B created in Sep 15th, then we add 1 to number of duplication on Sep 15th (not Sep 1st).


Currently I got a way to do it (in perl) as below:

(1) Read all files inside the folder, print out file content, and modified time into a big file in the format:
Code:
[URL] [month] [day]
(2) Sort the file by month then by day


(3)
Then Create two hash : date and content

And then read url from each line into a hash 'content' from the big file (key is url, value left undef), for each new read url found in hash 'content', a duplication is detected, so mark '$date{$month.$day}++'

The algorithm could be working but may take too long... so I am wondering if there is some easier way to do that besides hashes.

Any idea is well appreciated,

Thanks,

-Kun
Create an associative array in Perl, Ruby, Python, etc, etc. Have the key be the URL and the value be whatever you care to collect, including a class able to hold various kinds of data associated with the URL. The duplicates can be easily detected and summed because the key (the URL) will already be present in the array after the first occurrence, so you increment a counter representing the number of occurrences.

Any nontrivial scripting language can do this. It's a simple matter of actually doing it.

Quote:
Originally Posted by Kunsheng View Post
I got a folder containing 20,0000 files
I hope you are ready to lose all your data. Never accumulate this many files in a single directory. This is what databases are designed for. In fact, if the URLs were in a database right now, your problems would be solved. -- you would write a few lines of SQL and be done. In fact, that's how I recommend that you do it -- you cold perform all sorts of ingenious manipulations and data searches, in an environment that is designed for this specific kind of task.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
File System statistics HeartBeat Linux - Newbie 1 06-18-2009 07:15 PM
File duplication (sort of) abharsair Linux - Server 8 01-26-2009 10:08 AM
PERL can't read contents of directory. knockout_artist Linux - Newbie 1 09-25-2008 08:16 PM
how to erase contents in a file in perl john83reuben Programming 3 04-04-2008 02:57 PM
perl(Cwd) perl(File::Basename) perl(File::Copy) perl(strict)....What are those? Baldorg Linux - Software 1 11-09-2003 08:09 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 05:52 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration