LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   Script to find duplicate files within one or more directories (https://www.linuxquestions.org/questions/linux-general-1/script-to-find-duplicate-files-within-one-or-more-directories-507834/)

peter88 12-06-2006 01:50 AM

Script to find duplicate files within one or more directories
 
Hi, has anyone got a script which does more or less the following please:

I have 2 directories with say about 1500 photos in each dir. Now, I know that some of the photos are the same even if they have different timestamps and names. To avoid laboriously doing a visual comparison, what I would like to do is run a scrit (bash or other) which will list the files which are similar. It could be that within each dir there are duplicates with differing names or/and that there are duplicates between the 2 dirs. Actually in theory one should be able to pass $1 $2 $3... as directories to compare, it need not be limited to one or two directories. Also, here I am talking of image files, however I woudl like the program to be generic to be able to compare any type of file in the directories being compared and I would guess that if using a md5 hash signature then the type of file is immaterial (please tell me if I'm talking nonse, I won't be offended :) ).

I was thinking a script could do this by creating a md5 or other hash checksum of the files in the directories and then compare each file to the stored checksums to create a list of files which have the same md5 value and hence should be identical. Perhaps someone knows existing functions with the unix/linux suite of tools such as the various shells or awk, perl, php, python etc. which I am not aware of..

If someone knows a progrom or a script I could run under w32 (WXP say) then that would be useful too as I can perform the task on either system and then move the files across if necessary. In anycase as I use both environments it would be useful to know how to do it in both.

Any advise appreciated.


TIA.

psisquare 12-06-2006 07:33 PM

Here's a quick hack with bash and awk:
Code:

md5sum dir1/* dir2/* | sort | awk '{ if(lastmd5==$1) print lastfile, $2; lastmd5=$1; lastfile=$2; }'
If you need something that also runs on Windows, you can do essentially the same with Ruby, Python or Perl - whichever language you know best.

Also, note that there's a slight twist to using hashes for this: If the images were not just copied between the directories, but re-encoded, they can have different md5sums even though they look the same. If that's the case, you could try GQview, which has a "find duplicates" feature (that I've never tested).

psisquare 12-06-2006 08:10 PM

Oow well, I'm trying to learn Ruby anyway and this looks like a nice exercise. :)

Code:

require 'digest/md5'

list = {}

Dir.foreach(ARGV[0]) do |file|
  path = File.join(ARGV[0], file)
  next if not File.file? path
  list[Digest::MD5.digest(File.read(path))] = path
end

Dir.foreach(ARGV[1]) do |file|
  path = File.join(ARGV[1], file)
  next if not File.file? path
  if otherpath=list[Digest::MD5.digest(File.read(path))]
      puts path + "\t" + otherpath
  end
end


frob23 12-06-2006 08:36 PM

Here is a little bit more complicated script. But what it returns in exchange for complexity is that it's less intense than the quick hack above. Note: I wrote this on FreeBSD so things might be a little bit off. You may want to check that the -ls option to find returns the size in the 7th spot and the filename in the 11th. Modify those values if they don't.

This program checks all subdirectories below the points you request. It also only sums the files which have matching sizes (which will very likely reduce the load tremendously as you're not hashing every file).

Code:

#!/bin/sh

#md5prog="md5 -r"
md5prog="md5sum"
old=-1
count=0

docompare() {
        ${md5prog} ${*} | sort | awk 'BEGIN{
        prevsum=0;
        dup=0;}
{if (prevsum==$1) {
        printf "%s ",previous;
        dup=1;}
else {
        if (dup==1) {
                printf "%s\n\n",previous;
        }
        dup=0;}
prevsum=$1;
previous=$2;}'
}

mainloop() {
        new=${1}
        if [ ${new} -eq ${old} ]; then
                count=`echo "${count}+1" | bc`;
                files="${files} ${2}";
        else
                if [ ${count} -gt 1 ]; then
                        docompare ${files};
                fi
                count=0;
                files="";
                old=${new};
        fi
}

if [ x"${1}" = "x" ]; then
        echo "Usage: `basename ${0}` {path1} [path2 ...]";
        exit 1;
fi

find ${*} -ls | awk '{print $7 " " $11}' | sort -n | while read LINE
do
        mainloop $LINE
done

Edit: A couple lines above might look odd to the more experienced Bash programmers because this script is /bin/sh compatible. And I couldn't use some of the neat little tricks you might know about.

psisquare 12-07-2006 10:29 AM

Fearing that my clumsy and inefficient hack might give Ruby a bad reputation, I'll give you a more elegant version along the lines of my bash hack:
Code:

#!/usr/bin/ruby
require 'digest/md5'; last=[]
Dir[File.join("{#{ARGV[0]},#{ARGV[1]}}","*")].
  map{|f| [Digest::MD5.digest(File.read(f)), f]}.sort_by{|p| p[0]}.
  each{|p| puts last[1]+"\t"+p[1] if p[0]==last[0]; last=p}

After reading frob23's post, I also came up with the following one. It's the fastest solution I've tested so far, also comparing sizes first, does a recursive search of an arbitrary number of directories and adds some error checking:
Code:

#!/usr/bin/ruby
require 'digest/md5'
(puts "Usage: #{$0} {dir1} [dir2 ...]"; exit) if ARGV.empty?
sizes={}; prev=[]; dup=false

Dir[File.join("{#{ARGV.join(',')}}","**","*")].select{|f| File.file?(f)}.
        each{|f| if sizes[size=File.size(f)] then sizes[size].push f else sizes[size] = [f] end}

sizes.each do |size,files|
        next if files.length==1
        files.map{|f| [Digest::MD5.digest(File.read(f)), f]}.sort_by{|p| p[0]}.each do |p|
                if p[0]==prev[0]
                        prev[1] += "\t" + p[1]
                        dup = true
                else
                        puts prev[1] if dup
                        dup = false
                        prev = p
                end
        end
end


peter88 12-10-2006 02:37 AM

Thanks a lot for your suggestions everyone, I will try them :)

FYI: for those interested in a possible W32 solution, here is a nice small program I found:

FINDDUPE: Duplicate file detector and eliminator:

http://www.sentex.net/~mwandel/finddupe/

Tortanick 12-10-2006 05:17 AM

Quote:

Originally Posted by frob23
Edit: A couple lines above might look odd to the more experienced Bash programmers because this script is /bin/sh compatible. And I couldn't use some of the neat little tricks you might know about.

Just out of curiosity, why not just say #! /bin/bash rather than /bin/sh and then use the tricks?


All times are GMT -5. The time now is 08:10 PM.