Need a certain program to find true duplicate files
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Need a certain program to find true duplicate files
First, please let me apologize because I know this question does not really belong in this forum. But quite honestly, I didn't know WHERE to post it, so I'm posting it here, in the sincere hopes that someone will tell me where to go.
OK, I need a program. I am a complete newbie to Linux, and I think it's very possible that the program I need is either 1) Already available. Or 2) Fairly easily accomplished within a BASH (or PERL?) script. But like I said, being a total newbie, I'm really at a complete loss where to even begin.
So let me describe what I need, and I ask the good people here to please tell me where I can go to 1) either correctly ask this question, or 2) do the research to find/create a solution myself.
In simplest terms, I need a program that will compare two folders (recursively, if requested by user-supplied switch), and output either all files which have identical matching HASH results, or all files for which there is no identical HASH match. (which list is output would be determined by a user-supplied switch)
Several months ago I did a very quick look around (among Windows programs) and found any number of utilities that will do this in a superficial manner based on the filename/date/size/type. But this will not meet my needs. Further, most of the utilities I found require that the matching files be in the same relative path within the compared folders. This means that if two files are actually identical, but reside in different paths within the two folders, these utilities will not present them to the user as being identical.
In my mind I can see the bare/rough outline of how this could be written with a simple script (BASH? PERL?). But it's been 18+ years since I've done any programming (using AREXX on my old Amiga!), and so I wouldn't even know where to begin to figure out how to write this myself. And since this basic idea seems like such a universally needed tool, I'd be surprised if there were not something already written that did this. Sadly though, if a utility like this already exists, I haven't a clue where to find it.
Again, I *KNOW* this forum is *NOT* the correct place to ask this question, and for those of you who are peeved by my posting here, I give you my sincere apology. I just hope that someone will read what I need and be kind enough to point me in the correct direction so that I can either ask the question in the appropriate location, or point me in a solid direction to find a viable solution for my needs.
(If anyone cares, the need for this utility is due to the fact that, at various times over the years, I have copied large collections of files (photos, mp3's, and others) to multiple computers. Over time I moved some of these files around, renamed others, and so each computer became organized differently from each other computer. Now I want to consolidate all these files into one central repository. Unfortunately, this means I have many files that are actuality identical, but which may have different names/dates, and/or reside in different relative paths. Now I need a way to list out those files which really are identical, so I can decide which ones to keep/move, and which to delete. Additionally, I need to find all files which are unique and not duplicated so that I can move those where I deem appropriate.)
Thanks so much for your patience with my loooong post, and for pointing me in the correct direction.
Well, that's definitely a prog qn, so I wouldn't worry about using this forum.
Sounds like you want to compare every file with every other file and list either matches or non-matches.
In Other Words;
1. create a list of all files from some common root. You can arrange this before you start eg
/home/me/files1
/home/me/files2
etc
2. Then, for each file in master list, either
2a list matches
OR
2b list non-matches.
Someone has done the brunt of the work for you with a utility called md5deep. One of its features is a “recursive” md5. The following pipeline gives you a wealth of information:
Code:
md5deep -r folder1 folder2 folder3 | sort
If you have GNU uniq, you can filter this further depending on whether you want to find the unique ones or the repeated ones (up to the first 32 characters). For example, this
here's one i knocked up for myself. I've used it ok for housekeeping up old disks
and partitions over the years.
beware zero byte files are included (put ! -size 0 in the 'find' to fix)
use it like this
absolute pathnames, put the master dir first if you want to remove duplicates in subsequent ones.
Code:
find-dups.pl dir1 ... > dupes.sh # creates a shell script to rm dupes
sh dupes.sh > recover.sh # does the rm and create a script to restore
Code:
# use Env;
# use lib "$HOME/Perl5Lib";
# use hashdump;
$remove = "rm -f";
die "neeed absolute pathnames@ARGV\n" unless @ARGV;
die "neeed *absolute* pathnames: @ARGV\n" if grep !m|^/|, @ARGV;
open IN, "find @ARGV $options -type f -exec md5sum \{} \\; |";
sub do_find {
my $IN = shift;
while (<$IN>) {
($sum, $file) = split " ", $_, 2;
chomp $file;
push @{$H{$sum}}, "\Q$file";
}
}
sub recover {
@dupe = @_;
$saved = shift @dupe;
print "echo cp \Q$saved\E \Q$_\E\n" foreach @dupe;
}
do_find \*IN;
$" = "\n$remove\t";
print "\n";
while (($k, $v) = each %H) {
next unless @$v > 1;
@list = @$v;
print "\n# duplicate:$k\n#\t@list\n";
recover @list;
}
this program works fine for the same sort of thing.
all the funny quotes you see will allow for all sorts of funny
filenames, including spaces and brackets etc.
the advantage of doing it like this, i.e. making a script
to do the work is you can give it a good check first.
i also added the get out clause. so you can easily test it
and restore.
edit: on sourceforge there is an mp3 compare, which i assume checks the music itself, i.e. ignoring tags.
Last edited by bigearsbilly; 06-02-2008 at 06:35 PM.
All three of you gave me excellent ideas and pointers on what direction to go next! Thank you SO MUCH!!!
Chris, your summation of what I want is dead on, but I was a little worried because it's been so many many years since I've done any coding, and until now I've never even taken a first look at PERL. But your links give me an excellent place to start and I'll be able to use them as I start doing more and more coding for the misc small things I want/need to do.
Osor, your suggestion of md5deep was a godsend. After I tried it out, I saw immediately how it could do a the bulk of what I need done. At the same time though, I was a bit confused why you indicated that my ignorant attempts at writing a PERL script would result in faster runtime. Guess I'd just have to try both ways to really see why you say that.
bigearsbilly, I'm a bit short on time tonight, so I can't really study what you've written. I've spent a the last several minutes trying grok it. But clearly I'm going to need to actually sit down and read more about PERL syntax and commands to make real sense of it. On the other hand, I see your code as an ideal example to study in order to work towards my goal of learning more about PERL.
Check out fdupes as well. it goes a step beyond what you describe in that it does the md5sum check and if they match, does a bitwise comp. Very handy utility.
i think I used fdupes before myself,
but I needed it to use standalone anywhere (BSD, SOlaris, Linux) without bothering with all that CPAN and package nonsense. So I kept it pretty standard.
(I had an old disk on my solaris box to clean by booting with a puppy live cd)
On2ndThought:
it's quite simple (though I am using arcane Perl references which take a while to get, like C pointers do)
basically it's this:
go through every file and take it's md5sum.
use that as a key in the hash (associative array), with the file name as the data as a list.
any duplicate sum is appended to the list.
any list with more than 2 members must therefore be duplicates.
(well almost, the chances of the same md5sum are very very very small)
simple.
the silly \Q stuff makes sure anything that is not a letter, digit or
underscore is quoted, it becomes unreadable but very safe.
perl is a bit of a pain in the arse because you need to jump through hoops
to get arrays of lists, but there you go.
though I quite enjoy it now but it takes a while to 'get' it.
There's a program called 'hardlink(s?)' which may do some of what you want -or more. If I understand correctly it is meant to find duplicate files and remove all copies and then create hardlinks to the single copy for the otheres. But, it may have a listing functionality which would be helpful.
the only prob with a hardlinks thing would be you can't
create them on different filesystems i.e. you couldn't
check different disks or even separate partitions.
I was a bit confused why you indicated that my ignorant attempts at writing a PERL script would result in faster runtime.
What I meant was that if done right, a Perl solution should be faster because hashes may be used as containers rather than implement a full “sort” operation (which tends to be expensive).
Btw, I had never heard of fdupes—it seems like the optimal solution.
Btw, I had never heard of fdupes—it seems like the optimal solution.
It's very useful - I used it to sort out a bit of a mess I'd created in quite a large photo library where I'd kept pre-culling backups, etc, but had no idea what was actually unique.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.