LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-31-2008, 07:23 PM   #1
On2ndThought
LQ Newbie
 
Registered: Apr 2007
Posts: 13

Rep: Reputation: 0
Question Need a certain program to find true duplicate files


First, please let me apologize because I know this question does not really belong in this forum. But quite honestly, I didn't know WHERE to post it, so I'm posting it here, in the sincere hopes that someone will tell me where to go.

OK, I need a program. I am a complete newbie to Linux, and I think it's very possible that the program I need is either 1) Already available. Or 2) Fairly easily accomplished within a BASH (or PERL?) script. But like I said, being a total newbie, I'm really at a complete loss where to even begin.

So let me describe what I need, and I ask the good people here to please tell me where I can go to 1) either correctly ask this question, or 2) do the research to find/create a solution myself.

In simplest terms, I need a program that will compare two folders (recursively, if requested by user-supplied switch), and output either all files which have identical matching HASH results, or all files for which there is no identical HASH match. (which list is output would be determined by a user-supplied switch)

Several months ago I did a very quick look around (among Windows programs) and found any number of utilities that will do this in a superficial manner based on the filename/date/size/type. But this will not meet my needs. Further, most of the utilities I found require that the matching files be in the same relative path within the compared folders. This means that if two files are actually identical, but reside in different paths within the two folders, these utilities will not present them to the user as being identical.

In my mind I can see the bare/rough outline of how this could be written with a simple script (BASH? PERL?). But it's been 18+ years since I've done any programming (using AREXX on my old Amiga!), and so I wouldn't even know where to begin to figure out how to write this myself. And since this basic idea seems like such a universally needed tool, I'd be surprised if there were not something already written that did this. Sadly though, if a utility like this already exists, I haven't a clue where to find it.

Again, I *KNOW* this forum is *NOT* the correct place to ask this question, and for those of you who are peeved by my posting here, I give you my sincere apology. I just hope that someone will read what I need and be kind enough to point me in the correct direction so that I can either ask the question in the appropriate location, or point me in a solid direction to find a viable solution for my needs.

(If anyone cares, the need for this utility is due to the fact that, at various times over the years, I have copied large collections of files (photos, mp3's, and others) to multiple computers. Over time I moved some of these files around, renamed others, and so each computer became organized differently from each other computer. Now I want to consolidate all these files into one central repository. Unfortunately, this means I have many files that are actuality identical, but which may have different names/dates, and/or reside in different relative paths. Now I need a way to list out those files which really are identical, so I can decide which ones to keep/move, and which to delete. Additionally, I need to find all files which are unique and not duplicated so that I can move those where I deem appropriate.)

Thanks so much for your patience with my loooong post, and for pointing me in the correct direction.
 
Old 05-31-2008, 07:40 PM   #2
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Well, that's definitely a prog qn, so I wouldn't worry about using this forum.


Sounds like you want to compare every file with every other file and list either matches or non-matches.
In Other Words;
1. create a list of all files from some common root. You can arrange this before you start eg
/home/me/files1
/home/me/files2
etc
2. Then, for each file in master list, either
2a list matches
OR
2b list non-matches.

2a makes more sense if you think about it.

I'd definitely use Perl (http://perldoc.perl.org/5.8.8/) and md5sum. See this module http://search.cpan.org/~gaas/Digest-1.15/Digest.pm
For the recursion see this tutorial: http://perldoc.perl.org/5.8.8/perlopentut.html and look at opendir() / File::Find.
The opendir() tehnique is more obvious, even though the page recommends the File::Find module/method as possibly a better soln. YMMV

HTH
 
Old 05-31-2008, 08:31 PM   #3
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 78
Someone has done the brunt of the work for you with a utility called md5deep. One of its features is a “recursive” md5. The following pipeline gives you a wealth of information:
Code:
md5deep -r folder1 folder2 folder3 | sort
If you have GNU uniq, you can filter this further depending on whether you want to find the unique ones or the repeated ones (up to the first 32 characters). For example, this
Code:
md5deep -r folder1 folder2 folder3 | sort | uniq -w32 --all-repeated=separate | cut -d' ' -f3
gives group of lists of all files which are the same, separated by an extra newline.

If you want all files which have a unique md5sum, you can do
Code:
md5deep -r folder1 folder2 folder3 | sort | uniq -w32 -u | cut -d' ' -f3
If you do not have access to GNU uniq, or you feel more comfortable with Perl, you can use it as well.

P.S.
If you are up to it, the Perl solution Chris suggests will have a much faster runtime (if that’s important to you)

Last edited by osor; 05-31-2008 at 08:51 PM. Reason: added postscript
 
Old 06-02-2008, 04:28 PM   #4
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
here's one i knocked up for myself. I've used it ok for housekeeping up old disks
and partitions over the years.
beware zero byte files are included (put ! -size 0 in the 'find' to fix)


use it like this
absolute pathnames, put the master dir first if you want to remove duplicates in subsequent ones.
Code:
find-dups.pl dir1 ...  > dupes.sh  # creates a shell script to rm dupes
sh dupes.sh > recover.sh           # does the rm and create a  script to restore
Code:
# use Env;
# use lib "$HOME/Perl5Lib";
# use hashdump;
$remove = "rm -f";
die "neeed absolute pathnames@ARGV\n" unless @ARGV;
die "neeed *absolute* pathnames: @ARGV\n" if grep !m|^/|, @ARGV;
open IN, "find @ARGV $options -type f -exec md5sum \{} \\; |";

sub do_find {

    my $IN = shift;
    while (<$IN>) {

	($sum, $file) = split " ", $_, 2;
	chomp $file;
	push @{$H{$sum}}, "\Q$file";
    }
}

sub recover {
    @dupe = @_;
    $saved = shift @dupe;

    print "echo cp \Q$saved\E \Q$_\E\n" foreach @dupe;

}

do_find \*IN;
$" = "\n$remove\t";
print "\n";
while (($k, $v) = each %H) {
    next unless @$v > 1;
    @list = @$v;
    print "\n# duplicate:$k\n#\t@list\n";
    recover @list;

}
this program works fine for the same sort of thing.
all the funny quotes you see will allow for all sorts of funny
filenames, including spaces and brackets etc.

the advantage of doing it like this, i.e. making a script
to do the work is you can give it a good check first.

i also added the get out clause. so you can easily test it
and restore.



edit: on sourceforge there is an mp3 compare, which i assume checks the music itself, i.e. ignoring tags.

Last edited by bigearsbilly; 06-02-2008 at 06:35 PM.
 
Old 06-03-2008, 12:40 AM   #5
On2ndThought
LQ Newbie
 
Registered: Apr 2007
Posts: 13

Original Poster
Rep: Reputation: 0
Thumbs up Thank You!!! :)

Gentlemen, THANK YOU!!!

All three of you gave me excellent ideas and pointers on what direction to go next! Thank you SO MUCH!!!

Chris, your summation of what I want is dead on, but I was a little worried because it's been so many many years since I've done any coding, and until now I've never even taken a first look at PERL. But your links give me an excellent place to start and I'll be able to use them as I start doing more and more coding for the misc small things I want/need to do.

Osor, your suggestion of md5deep was a godsend. After I tried it out, I saw immediately how it could do a the bulk of what I need done. At the same time though, I was a bit confused why you indicated that my ignorant attempts at writing a PERL script would result in faster runtime. Guess I'd just have to try both ways to really see why you say that.

bigearsbilly, I'm a bit short on time tonight, so I can't really study what you've written. I've spent a the last several minutes trying grok it. But clearly I'm going to need to actually sit down and read more about PERL syntax and commands to make real sense of it. On the other hand, I see your code as an ideal example to study in order to work towards my goal of learning more about PERL.

So again, I thank each of you SO MUCH!!!
 
Old 06-03-2008, 01:31 AM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
if your system has md5sum command, it can also be used. Here's one i did long ago for another purpose, but what you need is the part on md5sum.
Code:
awk 'BEGIN{
    ################## Simple File/Dir Synchronizer #################
    ## Author: GhostDog74                                       #####
    ## Date: Now
    #################################################################
    q="\047"
    
    # Modifiable variables.
    source="/source" 
    destination="/destination"
    
    #### Remove all empty source directories #####
    FindEmptyDir = "find "source" -type d -empty -exec rm -rf "q "{}" q " \\;"
    print "Find empty directory command: " FindEmptyDir
    # system(FindEmptyDir) ## uncomment to use
    
    ####  Check all directories first, make directories if not exists #####
    FindSrcDirCmd = "find "source" -type d" 
    while ( FindSrcDirCmd | getline flinesrc ) {   
        olddir=flinesrc                        
        gsub( source,destination,flinesrc)
        tcmd = "test -d "q flinesrc q
        r = system(tcmd)
        if ( r == 1) {            
            create = "mkdir -p "flinesrc
            print "Create directory command: "create
            #system(cmd)            #uncomment to use.
        }else if ( r==0 ){
                print "Directory "flinesrc" exists"
        }
    }
    #####  Then check the files #########
    FindSrcFilesCmd = "find "source" -type f"
    while ( FindSrcFilesCmd | getline flinesrc ) {
        orgfile = flinesrc
        gsub( source,destination,flinesrc)
        tcmd = "test -f "q flinesrc q
        r = system(tcmd)
        if ( r == 1) {          
            print "Moving source file: "orgfile " to destination " flinesrc  
            movefilecmd = "mv "q orgfile q" "q flinesrc q
            print "move command: "  movefilecmd
            #system(movefilecmd)            #uncomment to use.
        }else if ( r==0 ) {
            print "File "flinesrc" exist..."
            
            ### Start to compare files using md5sum###
            md5cmd1 = "md5sum <"orgfile
            md5cmd2 = "md5sum <"flinesrc
            md5cmd1 | getline md5r1
            md5cmd2 | getline md5r2
            if ( md5r1 == md5r2 ) {
                print "File "orgfile "and file "flinesrc" are the same"  
                print "removing original file "orgile
                rmcmd = "rm "q orgfile q
                print rmcmd
                #system(rmcmd) ## uncomment to use
            }
            close(md5cdm1)
            close(md5cmd2)
            close(rmcmd)            
        }
    }
 
Old 06-03-2008, 01:50 AM   #7
billymayday
LQ Guru
 
Registered: Mar 2006
Location: Sydney, Australia
Distribution: Fedora, CentOS, OpenSuse, Slack, Gentoo, Debian, Arch, PCBSD
Posts: 6,678

Rep: Reputation: 122Reputation: 122
Check out fdupes as well. it goes a step beyond what you describe in that it does the md5sum check and if they match, does a bitwise comp. Very handy utility.
 
Old 06-03-2008, 10:29 AM   #8
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
i think I used fdupes before myself,
but I needed it to use standalone anywhere (BSD, SOlaris, Linux) without bothering with all that CPAN and package nonsense. So I kept it pretty standard.
(I had an old disk on my solaris box to clean by booting with a puppy live cd)


On2ndThought:
it's quite simple (though I am using arcane Perl references which take a while to get, like C pointers do)

basically it's this:

go through every file and take it's md5sum.
use that as a key in the hash (associative array), with the file name as the data as a list.

any duplicate sum is appended to the list.
any list with more than 2 members must therefore be duplicates.
(well almost, the chances of the same md5sum are very very very small)
simple.

the silly \Q stuff makes sure anything that is not a letter, digit or
underscore is quoted, it becomes unreadable but very safe.


perl is a bit of a pain in the arse because you need to jump through hoops
to get arrays of lists, but there you go.
though I quite enjoy it now but it takes a while to 'get' it.
 
Old 06-03-2008, 11:22 AM   #9
gnashley
Amigo developer
 
Registered: Dec 2003
Location: Germany
Distribution: Slackware
Posts: 4,928

Rep: Reputation: 612Reputation: 612Reputation: 612Reputation: 612Reputation: 612Reputation: 612
There's a program called 'hardlink(s?)' which may do some of what you want -or more. If I understand correctly it is meant to find duplicate files and remove all copies and then create hardlinks to the single copy for the otheres. But, it may have a listing functionality which would be helpful.
 
Old 06-03-2008, 12:05 PM   #10
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
the only prob with a hardlinks thing would be you can't
create them on different filesystems i.e. you couldn't
check different disks or even separate partitions.
 
Old 06-03-2008, 05:38 PM   #11
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 78
Quote:
Originally Posted by On2ndThought View Post
I was a bit confused why you indicated that my ignorant attempts at writing a PERL script would result in faster runtime.
What I meant was that if done right, a Perl solution should be faster because hashes may be used as containers rather than implement a full “sort” operation (which tends to be expensive).

Btw, I had never heard of fdupes—it seems like the optimal solution.
 
Old 06-03-2008, 06:26 PM   #12
billymayday
LQ Guru
 
Registered: Mar 2006
Location: Sydney, Australia
Distribution: Fedora, CentOS, OpenSuse, Slack, Gentoo, Debian, Arch, PCBSD
Posts: 6,678

Rep: Reputation: 122Reputation: 122
Quote:
Originally Posted by osor View Post
Btw, I had never heard of fdupes—it seems like the optimal solution.
It's very useful - I used it to sort out a bit of a mess I'd created in quite a large photo library where I'd kept pre-culling backups, etc, but had no idea what was actually unique.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Software to find duplicate files mike_savoie Linux - Software 5 07-17-2010 03:04 PM
I need a GUI that can find duplicate files davidguygc Linux - Software 2 05-17-2007 05:54 AM
Script to find duplicate files within one or more directories peter88 Linux - General 6 12-10-2006 05:17 AM
Howto find duplicate files js72 Linux - Software 1 11-09-2003 04:55 AM
How to duplicate a space i.g Program Files, ? zLinuxz Linux - General 3 05-05-2002 02:03 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:04 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration