Script to find duplicate files within one or more directories
Hi, has anyone got a script which does more or less the following please:
I have 2 directories with say about 1500 photos in each dir. Now, I know that some of the photos are the same even if they have different timestamps and names. To avoid laboriously doing a visual comparison, what I would like to do is run a scrit (bash or other) which will list the files which are similar. It could be that within each dir there are duplicates with differing names or/and that there are duplicates between the 2 dirs. Actually in theory one should be able to pass $1 $2 $3... as directories to compare, it need not be limited to one or two directories. Also, here I am talking of image files, however I woudl like the program to be generic to be able to compare any type of file in the directories being compared and I would guess that if using a md5 hash signature then the type of file is immaterial (please tell me if I'm talking nonse, I won't be offended :) ). I was thinking a script could do this by creating a md5 or other hash checksum of the files in the directories and then compare each file to the stored checksums to create a list of files which have the same md5 value and hence should be identical. Perhaps someone knows existing functions with the unix/linux suite of tools such as the various shells or awk, perl, php, python etc. which I am not aware of.. If someone knows a progrom or a script I could run under w32 (WXP say) then that would be useful too as I can perform the task on either system and then move the files across if necessary. In anycase as I use both environments it would be useful to know how to do it in both. Any advise appreciated. TIA. |
Here's a quick hack with bash and awk:
Code:
md5sum dir1/* dir2/* | sort | awk '{ if(lastmd5==$1) print lastfile, $2; lastmd5=$1; lastfile=$2; }' Also, note that there's a slight twist to using hashes for this: If the images were not just copied between the directories, but re-encoded, they can have different md5sums even though they look the same. If that's the case, you could try GQview, which has a "find duplicates" feature (that I've never tested). |
Oow well, I'm trying to learn Ruby anyway and this looks like a nice exercise. :)
Code:
require 'digest/md5' |
Here is a little bit more complicated script. But what it returns in exchange for complexity is that it's less intense than the quick hack above. Note: I wrote this on FreeBSD so things might be a little bit off. You may want to check that the -ls option to find returns the size in the 7th spot and the filename in the 11th. Modify those values if they don't.
This program checks all subdirectories below the points you request. It also only sums the files which have matching sizes (which will very likely reduce the load tremendously as you're not hashing every file). Code:
#!/bin/sh |
Fearing that my clumsy and inefficient hack might give Ruby a bad reputation, I'll give you a more elegant version along the lines of my bash hack:
Code:
#!/usr/bin/ruby Code:
#!/usr/bin/ruby |
Thanks a lot for your suggestions everyone, I will try them :)
FYI: for those interested in a possible W32 solution, here is a nice small program I found: FINDDUPE: Duplicate file detector and eliminator: http://www.sentex.net/~mwandel/finddupe/ |
Quote:
|
All times are GMT -5. The time now is 08:10 PM. |