Need a certain program to find true duplicate files

On2ndThought · 05-31-2008, 07:23 PM

First, please let me apologize because I know this question does not really belong in this forum. But quite honestly, I didn't know WHERE to post it, so I'm posting it here, in the sincere hopes that someone will tell me where to go.

OK, I need a program. I am a complete newbie to Linux, and I think it's very possible that the program I need is either 1) Already available. Or 2) Fairly easily accomplished within a BASH (or PERL?) script. But like I said, being a total newbie, I'm really at a complete loss where to even begin.

So let me describe what I need, and I ask the good people here to please tell me where I can go to 1) either correctly ask this question, or 2) do the research to find/create a solution myself.

In simplest terms, I need a program that will compare two folders (recursively, if requested by user-supplied switch), and output either all files which have identical matching HASH results, or all files for which there is no identical HASH match. (which list is output would be determined by a user-supplied switch)

Several months ago I did a very quick look around (among Windows programs) and found any number of utilities that will do this in a superficial manner based on the filename/date/size/type. But this will not meet my needs. Further, most of the utilities I found require that the matching files be in the same relative path within the compared folders. This means that if two files are actually identical, but reside in different paths within the two folders, these utilities will not present them to the user as being identical.

In my mind I can see the bare/rough outline of how this could be written with a simple script (BASH? PERL?). But it's been 18+ years since I've done any programming (using AREXX on my old Amiga!), and so I wouldn't even know where to begin to figure out how to write this myself. And since this basic idea seems like such a universally needed tool, I'd be surprised if there were not something already written that did this. Sadly though, if a utility like this already exists, I haven't a clue where to find it.

Again, I *KNOW* this forum is *NOT* the correct place to ask this question, and for those of you who are peeved by my posting here, I give you my sincere apology. I just hope that someone will read what I need and be kind enough to point me in the correct direction so that I can either ask the question in the appropriate location, or point me in a solid direction to find a viable solution for my needs.

(If anyone cares, the need for this utility is due to the fact that, at various times over the years, I have copied large collections of files (photos, mp3's, and others) to multiple computers. Over time I moved some of these files around, renamed others, and so each computer became organized differently from each other computer. Now I want to consolidate all these files into one central repository. Unfortunately, this means I have many files that are actuality identical, but which may have different names/dates, and/or reside in different relative paths. Now I need a way to list out those files which really are identical, so I can decide which ones to keep/move, and which to delete. Additionally, I need to find all files which are unique and not duplicated so that I can move those where I deem appropriate.)

Thanks so much for your patience with my loooong post, and for pointing me in the correct direction.

chrism01 · 05-31-2008, 07:40 PM

Well, that's definitely a prog qn, so I wouldn't worry about using this forum.

Sounds like you want to compare every file with every other file and list either matches or non-matches.
In Other Words;
1. create a list of all files from some common root. You can arrange this before you start eg
/home/me/files1
/home/me/files2
etc
2. Then, for each file in master list, either
2a list matches
OR
2b list non-matches.

2a makes more sense if you think about it.

I'd definitely use Perl (http://perldoc.perl.org/5.8.8/) and md5sum. See this module http://search.cpan.org/~gaas/Digest-1.15/Digest.pm
For the recursion see this tutorial: http://perldoc.perl.org/5.8.8/perlopentut.html and look at opendir() / File::Find.
The opendir() tehnique is more obvious, even though the page recommends the File::Find module/method as possibly a better soln. YMMV

HTH

osor · 05-31-2008, 08:31 PM

Someone has done the brunt of the work for you with a utility called md5deep. One of its features is a “recursive” md5. The following pipeline gives you a wealth of information:

Code:

md5deep -r folder1 folder2 folder3 | sort

If you have GNU uniq, you can filter this further depending on whether you want to find the unique ones or the repeated ones (up to the first 32 characters). For example, this

Code:

md5deep -r folder1 folder2 folder3 | sort | uniq -w32 --all-repeated=separate | cut -d' ' -f3

gives group of lists of all files which are the same, separated by an extra newline.

If you want all files which have a unique md5sum, you can do

Code:

md5deep -r folder1 folder2 folder3 | sort | uniq -w32 -u | cut -d' ' -f3

If you do not have access to GNU uniq, or you feel more comfortable with Perl, you can use it as well.

P.S.
If you are up to it, the Perl solution Chris suggests will have a much faster runtime (if that’s important to you)

bigearsbilly · 06-02-2008, 04:28 PM

here's one i knocked up for myself. I've used it ok for housekeeping up old disks
and partitions over the years.
beware zero byte files are included (put ! -size 0 in the 'find' to fix)

use it like this
absolute pathnames, put the master dir first if you want to remove duplicates in subsequent ones.

Code:

find-dups.pl dir1 ...  > dupes.sh  # creates a shell script to rm dupes
sh dupes.sh > recover.sh           # does the rm and create a  script to restore

Code:

# use Env;
# use lib "$HOME/Perl5Lib";
# use hashdump;
$remove = "rm -f";
die "neeed absolute pathnames@ARGV\n" unless @ARGV;
die "neeed *absolute* pathnames: @ARGV\n" if grep !m|^/|, @ARGV;
open IN, "find @ARGV $options -type f -exec md5sum \{} \\; |";

sub do_find {

    my $IN = shift;
    while (<$IN>) {

	($sum, $file) = split " ", $_, 2;
	chomp $file;
	push @{$H{$sum}}, "\Q$file";
    }
}

sub recover {
    @dupe = @_;
    $saved = shift @dupe;

    print "echo cp \Q$saved\E \Q$_\E\n" foreach @dupe;

}

do_find \*IN;
$" = "\n$remove\t";
print "\n";
while (($k, $v) = each %H) {
    next unless @$v > 1;
    @list = @$v;
    print "\n# duplicate:$k\n#\t@list\n";
    recover @list;

}

this program works fine for the same sort of thing.
all the funny quotes you see will allow for all sorts of funny
filenames, including spaces and brackets etc.

the advantage of doing it like this, i.e. making a script
to do the work is you can give it a good check first.

i also added the get out clause. so you can easily test it
and restore.

edit: on sourceforge there is an mp3 compare, which i assume checks the music itself, i.e. ignoring tags.

On2ndThought · 06-03-2008, 12:40 AM

Gentlemen, THANK YOU!!!

All three of you gave me excellent ideas and pointers on what direction to go next! Thank you SO MUCH!!!

Chris, your summation of what I want is dead on, but I was a little worried because it's been so many many years since I've done any coding, and until now I've never even taken a first look at PERL. But your links give me an excellent place to start and I'll be able to use them as I start doing more and more coding for the misc small things I want/need to do.

Osor, your suggestion of md5deep was a godsend. After I tried it out, I saw immediately how it could do a the bulk of what I need done. At the same time though, I was a bit confused why you indicated that my ignorant attempts at writing a PERL script would result in faster runtime. Guess I'd just have to try both ways to really see why you say that.

bigearsbilly, I'm a bit short on time tonight, so I can't really study what you've written. I've spent a the last several minutes trying grok it. But clearly I'm going to need to actually sit down and read more about PERL syntax and commands to make real sense of it. On the other hand, I see your code as an ideal example to study in order to work towards my goal of learning more about PERL.

So again, I thank each of you SO MUCH!!!

ghostdog74 · 06-03-2008, 01:31 AM

if your system has md5sum command, it can also be used. Here's one i did long ago for another purpose, but what you need is the part on md5sum.

Code:

awk 'BEGIN{
    ################## Simple File/Dir Synchronizer #################
    ## Author: GhostDog74                                       #####
    ## Date: Now
    #################################################################
    q="\047"
    
    # Modifiable variables.
    source="/source" 
    destination="/destination"
    
    #### Remove all empty source directories #####
    FindEmptyDir = "find "source" -type d -empty -exec rm -rf "q "{}" q " \\;"
    print "Find empty directory command: " FindEmptyDir
    # system(FindEmptyDir) ## uncomment to use
    
    ####  Check all directories first, make directories if not exists #####
    FindSrcDirCmd = "find "source" -type d" 
    while ( FindSrcDirCmd | getline flinesrc ) {   
        olddir=flinesrc                        
        gsub( source,destination,flinesrc)
        tcmd = "test -d "q flinesrc q
        r = system(tcmd)
        if ( r == 1) {            
            create = "mkdir -p "flinesrc
            print "Create directory command: "create
            #system(cmd)            #uncomment to use.
        }else if ( r==0 ){
                print "Directory "flinesrc" exists"
        }
    }
    #####  Then check the files #########
    FindSrcFilesCmd = "find "source" -type f"
    while ( FindSrcFilesCmd | getline flinesrc ) {
        orgfile = flinesrc
        gsub( source,destination,flinesrc)
        tcmd = "test -f "q flinesrc q
        r = system(tcmd)
        if ( r == 1) {          
            print "Moving source file: "orgfile " to destination " flinesrc  
            movefilecmd = "mv "q orgfile q" "q flinesrc q
            print "move command: "  movefilecmd
            #system(movefilecmd)            #uncomment to use.
        }else if ( r==0 ) {
            print "File "flinesrc" exist..."
            
            ### Start to compare files using md5sum###
            md5cmd1 = "md5sum <"orgfile
            md5cmd2 = "md5sum <"flinesrc
            md5cmd1 | getline md5r1
            md5cmd2 | getline md5r2
            if ( md5r1 == md5r2 ) {
                print "File "orgfile "and file "flinesrc" are the same"  
                print "removing original file "orgile
                rmcmd = "rm "q orgfile q
                print rmcmd
                #system(rmcmd) ## uncomment to use
            }
            close(md5cdm1)
            close(md5cmd2)
            close(rmcmd)            
        }
    }

billymayday · 06-03-2008, 01:50 AM

Check out fdupes as well. it goes a step beyond what you describe in that it does the md5sum check and if they match, does a bitwise comp. Very handy utility.

bigearsbilly · 06-03-2008, 10:29 AM

i think I used fdupes before myself,
but I needed it to use standalone anywhere (BSD, SOlaris, Linux) without bothering with all that CPAN and package nonsense. So I kept it pretty standard.
(I had an old disk on my solaris box to clean by booting with a puppy live cd)

On2ndThought:
it's quite simple (though I am using arcane Perl references which take a while to get, like C pointers do)

basically it's this:

go through every file and take it's md5sum.
use that as a key in the hash (associative array), with the file name as the data as a list.

any duplicate sum is appended to the list.
any list with more than 2 members must therefore be duplicates.
(well almost, the chances of the same md5sum are very very very small)
simple.

the silly \Q stuff makes sure anything that is not a letter, digit or
underscore is quoted, it becomes unreadable but very safe.

perl is a bit of a pain in the arse because you need to jump through hoops
to get arrays of lists, but there you go.
though I quite enjoy it now but it takes a while to 'get' it.

gnashley · 06-03-2008, 11:22 AM

There's a program called 'hardlink(s?)' which may do some of what you want -or more. If I understand correctly it is meant to find duplicate files and remove all copies and then create hardlinks to the single copy for the otheres. But, it may have a listing functionality which would be helpful.

bigearsbilly · 06-03-2008, 12:05 PM

the only prob with a hardlinks thing would be you can't
create them on different filesystems i.e. you couldn't
check different disks or even separate partitions.

osor · 06-03-2008, 05:38 PM

Quote:

Originally Posted by On2ndThought

I was a bit confused why you indicated that my ignorant attempts at writing a PERL script would result in faster runtime.

What I meant was that if done right, a Perl solution should be faster because hashes may be used as containers rather than implement a full “sort” operation (which tends to be expensive).

Btw, I had never heard of fdupes—it seems like the optimal solution.

billymayday · 06-03-2008, 06:26 PM

Quote:

Originally Posted by osor

Btw, I had never heard of fdupes—it seems like the optimal solution.

It's very useful - I used it to sort out a bit of a mess I'd created in quite a large photo library where I'd kept pre-culling backups, etc, but had no idea what was actually unique.