Question about finding duplicate files using a bash script.

Mithrilhall · 10-25-2020, 12:33 AM

I found this script online and modified it. It appears to do what I want but I figured I would check before using.

I'm trying to use sha512sum to check for duplicate files in a directory and to move the duplicates to the "duplicates" directory.

Code:

#!/bin/bash
#
# Usage:  ./delete-duplicates.sh  [<files...>]
#
declare -A filecksums

# No args, use files in current directory
test 0 -eq $# && set -- *

for file in "$@"
do
    # Files only (also no symlinks)
    [[ -f "$file" ]] && [[ ! -h "$file" ]] || continue

    # Generate the checksum
    cksum=$(sha512sum <"$file" | tr ' ' _)

    # Have we already got this one?
    if [[ -n "${filecksums[$cksum]}" ]] && [[ "${filecksums[$cksum]}" != "$file" ]]
    then
        echo "Found '$file' is a duplicate of '${filecksums[$cksum]}'" >&2
        # echo mv -v "$file" ./duplicates
        # echo mv "$file" ./duplicates
        echo "$file" | mv "$file" ./duplicates
    else
        filecksums[$cksum]="$file"
    fi
done

ondoho · 10-25-2020, 05:04 AM

Quote:

Originally Posted by Mithrilhall

Code:

# No args, use files in current directory
test 0 -eq $# && set -- *

What does this do?
AFAICS it is pointless.
edit: OK, I think I understand: If you pass an argument, use that, otherwise use all files in the curent dir. Yes?

Quote:

Originally Posted by Mithrilhall

Code:

echo "$file" | mv "$file" ./duplicates

The echo+pipe looks completely pointless to me?!

There's a few places where you can use bash builtins (string manipulation mostly) instead of pipes.
Since you are using bash anyhow, a big and bloated shell, at least make best use of it.
A pipe is easier to type than "${[@]}" but is very resource intensive - a new subshell each time.

And there's no quality control? What about hidden files and folders? You should make sure the ".duplicates" directory is excluded?

BTW, several excellent duplicate finding programs exist: fdupes, fslint...
Since this is a resource intensive task, better to use something coded in C maybe?

rtmistler · 10-26-2020, 08:38 AM

Unless you understand it or trust the source, I'd not recommend running a script found online. But that's why you're asking here.

Besides reviewing each line yourself to understand it, there's the methods of: partial review, testing, and using set -xv to enable debug when you run your tests.

By this I mean, review the script to the best of your capabilities, and also do some searching to help improve your understanding. Create a test set of files, some duplicates, some not, and put them in a test location. Run the script with debug enabled, by adding a "set -xv" line right after the script shebang line. Run it, observe the verbose output to evaluate what it's doing. Review what it did with the test files and confirm it has done things correctly.

Another thing is that I feel diff is fundamental enough to accomplish this, be that for binary or non-binary files. But using a checksum program is a good idea too, I'm assuming that different versions of a checksum tool, and etc would still always generate the same checksum because even though the source code changes, I'm not thinking the check sum method specification has changed. That is unless different options are used while invoking the checksum application.

To me although diff is not exactly used, it seems that some form of bash differencing or testing is performed with that if-statement. Further, the script seems to be constructed to use the filename arguments given to it. I wonder how it will work with wildcards, and etc. For instance if you use wildcards, it may give it directory names, and therefore that may cause problems within the script. I'm thinking it should verify that the file being tested "is" a file.

Shorter answer is that I wouldn't use this, I'd perform it differently, by using diff, checking that things are files, and also test it with wildcards.

Guttorm · 10-26-2020, 09:28 AM

Hi

I have a simple alias that I use to find duplicates. It's not perfect, but very fast, handles spaces and newlines and whatever in filenames.

Code:

alias dupes="find . -type f -print0 | xargs -0 sha1sum | sort | uniq -D -w 32"

It doesn't move files or do anything about them. I am sure it could be added, but doing so is a lot more complicated. Some things to think about:

- When there are duplicate files, should it move or delete them? And then which one of them? Random? And where? Moving it to a folder called duplicates inside the folder being checked can't be a good solution.

- What about errors?

- Newlines and " symbols are allowed in names of folders and files. Handling those are tricky in a shell script.

- Empty files will all be considered duplicates.

Maybe some solution exists, but I haven't seen any yet. I've used this script many times before. In my home directory, it usually reports lots of files that are duplicates. But there's a good reason. For example .htaccess files and so on. So I never trusted any script to just move or delete them.

Edit:

There's a bug in my alias. I think i've been playing with different checksum algorithms some time ago. When it's sha1, the length should be 40, not 32:

Code:

alias dupes="find . -type f -print0 | xargs -0 sha1sum | sort | uniq -D -w 40"

Skaperen · 10-29-2020, 09:55 PM

if you modify it, i would suggest changing from sha512sum to md5sum. the latter is considered cryptographically weak but this is not a crypto use. you are not try to defend against an adversary with this. so, md5 is sufficient and much faster (or uses less CPU if your storage device is slow to read).

syg00 · 10-29-2020, 11:53 PM

If 'twas me, I'd be inclined to spend some effort only targetting same-sized files. Would save considerably more CPU, and the chances of two different size files being the same are vanishingly small. Maybe about the same as a hash intersection ...

boughtonp · 10-30-2020, 07:50 AM

Quote:

Originally Posted by syg00

If 'twas me, I'd be inclined to spend some effort only targetting same-sized files. Would save considerably more CPU, and the chances of two different size files being the same are vanishingly small. Maybe about the same as a hash intersection ...

Depends on what counts as a duplicate: e.g if you have two identical photos but one has extra metadata attached, some people would say that's a duplicate.

Of course, as has been already pointed out, there are existing programs that no doubt can deal with all these sorts of intricacies in a performant way. No need to waste time re-writing bugs that have already been discovered and fixed.

syg00 · 10-30-2020, 08:00 AM

Indeed.

rnturn · 10-31-2020, 01:52 AM

Quote:

Originally Posted by Guttorm

- When there are duplicate files, should it move or delete them? And then which one of them? Random? And where? Moving it to a folder called duplicates inside the folder being checked can't be a good solution.

I have a dupefile script (originally Perl but migrating it to Python) that merely chucks everything into a database: one table for the checksum and path to the first file "owning" a particular checksum and a second table that holds the checksum and path to duplicates. There's a "unique" constraint on the checksum in the first table---violating that constraint means we've seen the checksum before so insert the current checksum and path to the second table.

Rather than deleting the duplicates -- which I figure is going to confuse the heck out of whatever application might be looking for them -- I go through the second table and create symbolic links pointing to the file in the first table that has the same checksum. I unleash this periodically on whole directory trees as I'm cleaning up a gazillion files that have accumulated for many years and been merged from multiple systems over that time. If I tried keeping the checksum list in memory (as the example script seems to do) my system would be thrashing in short order.

You can do most of this "by hand" as well. Collecting all the checksums/paths, extracting the checksums ("cat cksum.lis | cut -f1 -d' '"), sorting and keeping the ones corresponding to duplicate files -- i.e., those with a count >= 2 ("sort cksums | uniq -c | grep -v ' 1 ' | awk '{print $2}'") -- and using that checksum list to find all the files that correspond to each checksum. Keep the first one and move the others to your duplicates directory. Or delete them. Or... create the symbolic links as I've been doing. The scripts to massage the checksum list and deal with the duplicates are not terribly complex.

Good luck...