LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 12-06-2006, 01:50 AM   #1
peter88
LQ Newbie
 
Registered: Nov 2006
Posts: 20

Rep: Reputation: 0
Script to find duplicate files within one or more directories


Hi, has anyone got a script which does more or less the following please:

I have 2 directories with say about 1500 photos in each dir. Now, I know that some of the photos are the same even if they have different timestamps and names. To avoid laboriously doing a visual comparison, what I would like to do is run a scrit (bash or other) which will list the files which are similar. It could be that within each dir there are duplicates with differing names or/and that there are duplicates between the 2 dirs. Actually in theory one should be able to pass $1 $2 $3... as directories to compare, it need not be limited to one or two directories. Also, here I am talking of image files, however I woudl like the program to be generic to be able to compare any type of file in the directories being compared and I would guess that if using a md5 hash signature then the type of file is immaterial (please tell me if I'm talking nonse, I won't be offended ).

I was thinking a script could do this by creating a md5 or other hash checksum of the files in the directories and then compare each file to the stored checksums to create a list of files which have the same md5 value and hence should be identical. Perhaps someone knows existing functions with the unix/linux suite of tools such as the various shells or awk, perl, php, python etc. which I am not aware of..

If someone knows a progrom or a script I could run under w32 (WXP say) then that would be useful too as I can perform the task on either system and then move the files across if necessary. In anycase as I use both environments it would be useful to know how to do it in both.

Any advise appreciated.


TIA.

Last edited by peter88; 12-06-2006 at 01:53 AM.
 
Old 12-06-2006, 07:33 PM   #2
psisquare
Member
 
Registered: Sep 2004
Location: Germany
Distribution: Gentoo
Posts: 164

Rep: Reputation: 31
Here's a quick hack with bash and awk:
Code:
md5sum dir1/* dir2/* | sort | awk '{ if(lastmd5==$1) print lastfile, $2; lastmd5=$1; lastfile=$2; }'
If you need something that also runs on Windows, you can do essentially the same with Ruby, Python or Perl - whichever language you know best.

Also, note that there's a slight twist to using hashes for this: If the images were not just copied between the directories, but re-encoded, they can have different md5sums even though they look the same. If that's the case, you could try GQview, which has a "find duplicates" feature (that I've never tested).
 
Old 12-06-2006, 08:10 PM   #3
psisquare
Member
 
Registered: Sep 2004
Location: Germany
Distribution: Gentoo
Posts: 164

Rep: Reputation: 31
Oow well, I'm trying to learn Ruby anyway and this looks like a nice exercise.

Code:
require 'digest/md5'

list = {}

Dir.foreach(ARGV[0]) do |file|
   path = File.join(ARGV[0], file)
   next if not File.file? path
   list[Digest::MD5.digest(File.read(path))] = path
end

Dir.foreach(ARGV[1]) do |file|
   path = File.join(ARGV[1], file)
   next if not File.file? path
   if otherpath=list[Digest::MD5.digest(File.read(path))]
      puts path + "\t" + otherpath
   end
end
 
Old 12-06-2006, 08:36 PM   #4
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: Ubuntu, FreeBSD, NetBSD
Posts: 1,449

Rep: Reputation: 47
Here is a little bit more complicated script. But what it returns in exchange for complexity is that it's less intense than the quick hack above. Note: I wrote this on FreeBSD so things might be a little bit off. You may want to check that the -ls option to find returns the size in the 7th spot and the filename in the 11th. Modify those values if they don't.

This program checks all subdirectories below the points you request. It also only sums the files which have matching sizes (which will very likely reduce the load tremendously as you're not hashing every file).

Code:
#!/bin/sh

#md5prog="md5 -r"
md5prog="md5sum"
old=-1
count=0

docompare() {
        ${md5prog} ${*} | sort | awk 'BEGIN{
        prevsum=0;
        dup=0;}
{if (prevsum==$1) {
        printf "%s ",previous;
        dup=1;}
else {
        if (dup==1) {
                printf "%s\n\n",previous;
        }
        dup=0;}
prevsum=$1;
previous=$2;}'
}

mainloop() {
        new=${1}
        if [ ${new} -eq ${old} ]; then
                count=`echo "${count}+1" | bc`;
                files="${files} ${2}";
        else
                if [ ${count} -gt 1 ]; then
                        docompare ${files};
                fi
                count=0;
                files="";
                old=${new};
        fi
}

if [ x"${1}" = "x" ]; then
        echo "Usage: `basename ${0}` {path1} [path2 ...]";
        exit 1;
fi

find ${*} -ls | awk '{print $7 " " $11}' | sort -n | while read LINE
do
        mainloop $LINE
done
Edit: A couple lines above might look odd to the more experienced Bash programmers because this script is /bin/sh compatible. And I couldn't use some of the neat little tricks you might know about.

Last edited by frob23; 12-06-2006 at 08:42 PM.
 
Old 12-07-2006, 10:29 AM   #5
psisquare
Member
 
Registered: Sep 2004
Location: Germany
Distribution: Gentoo
Posts: 164

Rep: Reputation: 31
Fearing that my clumsy and inefficient hack might give Ruby a bad reputation, I'll give you a more elegant version along the lines of my bash hack:
Code:
#!/usr/bin/ruby
require 'digest/md5'; last=[]
Dir[File.join("{#{ARGV[0]},#{ARGV[1]}}","*")].
   map{|f| [Digest::MD5.digest(File.read(f)), f]}.sort_by{|p| p[0]}.
   each{|p| puts last[1]+"\t"+p[1] if p[0]==last[0]; last=p}
After reading frob23's post, I also came up with the following one. It's the fastest solution I've tested so far, also comparing sizes first, does a recursive search of an arbitrary number of directories and adds some error checking:
Code:
#!/usr/bin/ruby
require 'digest/md5'
(puts "Usage: #{$0} {dir1} [dir2 ...]"; exit) if ARGV.empty?
sizes={}; prev=[]; dup=false

Dir[File.join("{#{ARGV.join(',')}}","**","*")].select{|f| File.file?(f)}.
	each{|f| if sizes[size=File.size(f)] then sizes[size].push f else sizes[size] = [f] end}

sizes.each do |size,files|
	next if files.length==1
	files.map{|f| [Digest::MD5.digest(File.read(f)), f]}.sort_by{|p| p[0]}.each do |p|
		if p[0]==prev[0]
			prev[1] += "\t" + p[1]
			dup = true
		else
			puts prev[1] if dup
			dup = false
			prev = p
		end
	end
end
 
Old 12-10-2006, 02:37 AM   #6
peter88
LQ Newbie
 
Registered: Nov 2006
Posts: 20

Original Poster
Rep: Reputation: 0
Thanks a lot for your suggestions everyone, I will try them

FYI: for those interested in a possible W32 solution, here is a nice small program I found:

FINDDUPE: Duplicate file detector and eliminator:

http://www.sentex.net/~mwandel/finddupe/

Last edited by peter88; 12-10-2006 at 02:41 AM.
 
Old 12-10-2006, 05:17 AM   #7
Tortanick
Member
 
Registered: Jul 2006
Distribution: Debian Testing
Posts: 299

Rep: Reputation: 30
Quote:
Originally Posted by frob23
Edit: A couple lines above might look odd to the more experienced Bash programmers because this script is /bin/sh compatible. And I couldn't use some of the neat little tricks you might know about.
Just out of curiosity, why not just say #! /bin/bash rather than /bin/sh and then use the tricks?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Software to find duplicate files mike_savoie Linux - Software 5 07-17-2010 03:04 PM
Very simple shell script - move files to different directories beley Programming 7 11-02-2006 05:24 AM
find and copy files into multiple directories avargas22 Linux - Newbie 2 04-01-2004 11:11 AM
Find files, directories that are own by specific user mikeshn Linux - General 2 02-12-2004 03:52 PM
Howto find duplicate files js72 Linux - Software 1 11-09-2003 04:55 AM


All times are GMT -5. The time now is 02:24 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration