Purge duplicate files in one directory

smudge|lala · 03-27-2006, 08:10 PM

I seem to have this problem often. Linux intuitively renames files sometimes so if I had 'Car_pic.png' the duplicate would be named 'Car_pic(1).png'.

I am trying to figure out how to do this once and for all. I believe I need to pipe grep and sort commands together, sending duplicates to NULL (deleting them), or another directory as specified. Whether it's documents, logs, images, fonts or whatever, I always seem to end up with duplicate hell, I'm sure most of us do. Duplicates are a nightmare, and any database admin will know what I mean.

However, if the problem is clear, as in my .png example above, and the only difference between two files is an underscore '_' then this should be easy enough to sort. A more complex command might be to include checking the filesize, as one may have zero data (corrupt), and we don't want to keep that one!

I would be grateful for any tips.

Emerson · 03-27-2006, 08:22 PM

http://monsterden.net/software/dupefinder

Runs from console and also has QT GUI.

smudge|lala · 03-27-2006, 08:31 PM

It does work, thank you. If anyone knows a command sequence or script that might do something similar I'd be grateful for any input as I'm sure bash will do it. Dupefinder did find duplicates in under 10 seconds, but I'm not about to go through and mark 4111 files! Hence automating it by specifying the desired output.

mv file \*.png file_*.png or something

Matir · 03-27-2006, 09:43 PM

This little scripting sequence should do what you need:

Code:

md5sum * | sort | uniq -d -w32 | cut -d' ' -f3 | xargs rm

The md5sum checks that the *contents* of files are the same, rather than names. It then uses sort and uniq to compare checksums and pipes a list of file names to rm.

smudge|lala · 03-27-2006, 10:21 PM

Thanks for that. I do get an error unfortunately.

rm: cannot remove `Action': No such file or directory

I thought I coult edit the command and copy rather than remove into a new directory, and that returned:

Code:

User@localhost $ md5sum * | sort | uniq -d -w32 | cut -d' ' -f3 | xargs cp Unique/
md5sum: Unique: Is a directory
xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option
cp: `Fanatika': specified destination directory does not exist
Try `cp --help' for more information.

I know bash can do this, I just can't figure out how. Xargs is a really powerful command!

Matir · 03-27-2006, 10:35 PM

Try using:

Code:

xargs -i cp {} BACKUPS/

This is much like the find -exec syntax.

smudge|lala · 03-28-2006, 09:30 AM

Thank you for your input guys, but it still isn't working. The uniq command looks interesting especially after the md5sum. The command hangs which makes me think an option/switch hasn't been set correctly, possibly with md5sum * but I'm not sure. Maybe xargs isn't getting the correct input to proceed?

In considering how to approach such a sort and purge, I suppose the system can take one file, and search for a duplicate by md5 anywhere, but in the same directory is more likely, then drop one of the two files if a duplicate is found. If this is how this command is trying to work, then where is all the data of comparisons, of md5sums going? Can xargs handle input from 4000 files?

I tried with only 6 files, 3 sets of duplicates. I get:

Code:

md5sum: BACKUPS: Is a directory
cp: cannot stat `XFile': No such file or directory

for each result. This is with the command I issued:

md5sum * | sort | uniq -d -w32 | cut -d' ' -f3 | xargs -i cp {} BACKUPS/

Trying again with md5sum * | sort | uniq -d -w32 | cut -d' ' -f3 | xargs rm I get the same error. I'm only using 6 small files to test, and binary or text, they should still work seeing as we're using md5 right?

smudge|lala · 03-30-2006, 07:15 PM

Perhaps if they are named slightly differently, such as all my duplicates have an underscore '_' such as big_cat.png then surely I can do something like:

cp *.png | grep '_' dupebak/ although this doesn't work..

Any ideas?

Matir · 03-30-2006, 08:08 PM

If the only difference is the underscore, and no original files contain underscores, then you could do:

Code:

mv *_* newdir/