LinuxQuestions.org - Script to find duplicate files within one or more directories

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - Script to find duplicate files within one or more directories (https://www.linuxquestions.org/questions/linux-general-1/script-to-find-duplicate-files-within-one-or-more-directories-507834/)

Script to find duplicate files within one or more directories

Hi, has anyone got a script which does more or less the following please:

I have 2 directories with say about 1500 photos in each dir. Now, I know that some of the photos are the same even if they have different timestamps and names. To avoid laboriously doing a visual comparison, what I would like to do is run a scrit (bash or other) which will list the files which are similar. It could be that within each dir there are duplicates with differing names or/and that there are duplicates between the 2 dirs. Actually in theory one should be able to pass $1 $2 $3... as directories to compare, it need not be limited to one or two directories. Also, here I am talking of image files, however I woudl like the program to be generic to be able to compare any type of file in the directories being compared and I would guess that if using a md5 hash signature then the type of file is immaterial (please tell me if I'm talking nonse, I won't be offended :) ).

I was thinking a script could do this by creating a md5 or other hash checksum of the files in the directories and then compare each file to the stored checksums to create a list of files which have the same md5 value and hence should be identical. Perhaps someone knows existing functions with the unix/linux suite of tools such as the various shells or awk, perl, php, python etc. which I am not aware of..

If someone knows a progrom or a script I could run under w32 (WXP say) then that would be useful too as I can perform the task on either system and then move the files across if necessary. In anycase as I use both environments it would be useful to know how to do it in both.

Any advise appreciated.

TIA.

Here's a quick hack with bash and awk:

Code:

md5sum dir1/* dir2/* | sort | awk '{ if(lastmd5==$1) print lastfile, $2; lastmd5=$1; lastfile=$2; }'

If you need something that also runs on Windows, you can do essentially the same with Ruby, Python or Perl - whichever language you know best.

Also, note that there's a slight twist to using hashes for this: If the images were not just copied between the directories, but re-encoded, they can have different md5sums even though they look the same. If that's the case, you could try GQview, which has a "find duplicates" feature (that I've never tested).

Oow well, I'm trying to learn Ruby anyway and this looks like a nice exercise. :)

Code:

require 'digest/md5'



list = {}



Dir.foreach(ARGV[0]) do |file|

  path = File.join(ARGV[0], file)

  next if not File.file? path

  list[Digest::MD5.digest(File.read(path))] = path

end



Dir.foreach(ARGV[1]) do |file|

  path = File.join(ARGV[1], file)

  next if not File.file? path

  if otherpath=list[Digest::MD5.digest(File.read(path))]

      puts path + "\t" + otherpath

  end

end

Here is a little bit more complicated script. But what it returns in exchange for complexity is that it's less intense than the quick hack above. Note: I wrote this on FreeBSD so things might be a little bit off. You may want to check that the -ls option to find returns the size in the 7th spot and the filename in the 11th. Modify those values if they don't.

This program checks all subdirectories below the points you request. It also only sums the files which have matching sizes (which will very likely reduce the load tremendously as you're not hashing every file).

Code:

#!/bin/sh



#md5prog="md5 -r"

md5prog="md5sum"

old=-1

count=0



docompare() {

        ${md5prog} ${*} | sort | awk 'BEGIN{

        prevsum=0;

        dup=0;}

{if (prevsum==$1) {

        printf "%s ",previous;

        dup=1;}

else {

        if (dup==1) {

                printf "%s\n\n",previous;

        }

        dup=0;}

prevsum=$1;

previous=$2;}'

}



mainloop() {

        new=${1}

        if [ ${new} -eq ${old} ]; then

                count=`echo "${count}+1" | bc`;

                files="${files} ${2}";

        else

                if [ ${count} -gt 1 ]; then

                        docompare ${files};

                fi

                count=0;

                files="";

                old=${new};

        fi

}



if [ x"${1}" = "x" ]; then

        echo "Usage: `basename ${0}` {path1} [path2 ...]";

        exit 1;

fi



find ${*} -ls | awk '{print $7 " " $11}' | sort -n | while read LINE

do

        mainloop $LINE

done

Edit: A couple lines above might look odd to the more experienced Bash programmers because this script is /bin/sh compatible. And I couldn't use some of the neat little tricks you might know about.

Fearing that my clumsy and inefficient hack might give Ruby a bad reputation, I'll give you a more elegant version along the lines of my bash hack:

Code:

#!/usr/bin/ruby

require 'digest/md5'; last=[]

Dir[File.join("{#{ARGV[0]},#{ARGV[1]}}","*")].

  map{|f| [Digest::MD5.digest(File.read(f)), f]}.sort_by{|p| p[0]}.

  each{|p| puts last[1]+"\t"+p[1] if p[0]==last[0]; last=p}

After reading frob23's post, I also came up with the following one. It's the fastest solution I've tested so far, also comparing sizes first, does a recursive search of an arbitrary number of directories and adds some error checking:

Code:

#!/usr/bin/ruby

require 'digest/md5'

(puts "Usage: #{$0} {dir1} [dir2 ...]"; exit) if ARGV.empty?

sizes={}; prev=[]; dup=false



Dir[File.join("{#{ARGV.join(',')}}","**","*")].select{|f| File.file?(f)}.

        each{|f| if sizes[size=File.size(f)] then sizes[size].push f else sizes[size] = [f] end}



sizes.each do |size,files|

        next if files.length==1

        files.map{|f| [Digest::MD5.digest(File.read(f)), f]}.sort_by{|p| p[0]}.each do |p|

                if p[0]==prev[0]

                        prev[1] += "\t" + p[1]

                        dup = true

                else

                        puts prev[1] if dup

                        dup = false

                        prev = p

                end

        end

end

Thanks a lot for your suggestions everyone, I will try them :)

FYI: for those interested in a possible W32 solution, here is a nice small program I found:

FINDDUPE: Duplicate file detector and eliminator:

http://www.sentex.net/~mwandel/finddupe/

Quote:

Originally Posted by frob23

Edit: A couple lines above might look odd to the more experienced Bash programmers because this script is /bin/sh compatible. And I couldn't use some of the neat little tricks you might know about.

Just out of curiosity, why not just say #! /bin/bash rather than /bin/sh and then use the tricks?