LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-25-2020, 12:33 AM   #1
Mithrilhall
Member
 
Registered: Feb 2002
Location: Adamstown, Pitcairn Islands
Distribution: Neon
Posts: 291

Rep: Reputation: 30
Question Question about finding duplicate files using a bash script.


I found this script online and modified it. It appears to do what I want but I figured I would check before using.

I'm trying to use sha512sum to check for duplicate files in a directory and to move the duplicates to the "duplicates" directory.

Code:
#!/bin/bash
#
# Usage:  ./delete-duplicates.sh  [<files...>]
#
declare -A filecksums

# No args, use files in current directory
test 0 -eq $# && set -- *

for file in "$@"
do
    # Files only (also no symlinks)
    [[ -f "$file" ]] && [[ ! -h "$file" ]] || continue

    # Generate the checksum
    cksum=$(sha512sum <"$file" | tr ' ' _)

    # Have we already got this one?
    if [[ -n "${filecksums[$cksum]}" ]] && [[ "${filecksums[$cksum]}" != "$file" ]]
    then
        echo "Found '$file' is a duplicate of '${filecksums[$cksum]}'" >&2
        # echo mv -v "$file" ./duplicates
        # echo mv "$file" ./duplicates
        echo "$file" | mv "$file" ./duplicates
    else
        filecksums[$cksum]="$file"
    fi
done
 
Old 10-25-2020, 05:04 AM   #2
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by Mithrilhall View Post
Code:
# No args, use files in current directory
test 0 -eq $# && set -- *
What does this do?
AFAICS it is pointless.
edit: OK, I think I understand: If you pass an argument, use that, otherwise use all files in the curent dir. Yes?

Quote:
Originally Posted by Mithrilhall View Post
Code:
echo "$file" | mv "$file" ./duplicates
The echo+pipe looks completely pointless to me?!

There's a few places where you can use bash builtins (string manipulation mostly) instead of pipes.
Since you are using bash anyhow, a big and bloated shell, at least make best use of it.
A pipe is easier to type than "${[@]}" but is very resource intensive - a new subshell each time.

And there's no quality control? What about hidden files and folders? You should make sure the ".duplicates" directory is excluded?

BTW, several excellent duplicate finding programs exist: fdupes, fslint...
Since this is a resource intensive task, better to use something coded in C maybe?

Last edited by ondoho; 10-25-2020 at 05:06 AM.
 
Old 10-26-2020, 08:38 AM   #3
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,883
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
Unless you understand it or trust the source, I'd not recommend running a script found online. But that's why you're asking here.

Besides reviewing each line yourself to understand it, there's the methods of: partial review, testing, and using set -xv to enable debug when you run your tests.

By this I mean, review the script to the best of your capabilities, and also do some searching to help improve your understanding. Create a test set of files, some duplicates, some not, and put them in a test location. Run the script with debug enabled, by adding a "set -xv" line right after the script shebang line. Run it, observe the verbose output to evaluate what it's doing. Review what it did with the test files and confirm it has done things correctly.

Another thing is that I feel diff is fundamental enough to accomplish this, be that for binary or non-binary files. But using a checksum program is a good idea too, I'm assuming that different versions of a checksum tool, and etc would still always generate the same checksum because even though the source code changes, I'm not thinking the check sum method specification has changed. That is unless different options are used while invoking the checksum application.

To me although diff is not exactly used, it seems that some form of bash differencing or testing is performed with that if-statement. Further, the script seems to be constructed to use the filename arguments given to it. I wonder how it will work with wildcards, and etc. For instance if you use wildcards, it may give it directory names, and therefore that may cause problems within the script. I'm thinking it should verify that the file being tested "is" a file.

Shorter answer is that I wouldn't use this, I'd perform it differently, by using diff, checking that things are files, and also test it with wildcards.
 
Old 10-26-2020, 09:28 AM   #4
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Hi

I have a simple alias that I use to find duplicates. It's not perfect, but very fast, handles spaces and newlines and whatever in filenames.

Code:
alias dupes="find . -type f -print0 | xargs -0 sha1sum | sort | uniq -D -w 32"
It doesn't move files or do anything about them. I am sure it could be added, but doing so is a lot more complicated. Some things to think about:

- When there are duplicate files, should it move or delete them? And then which one of them? Random? And where? Moving it to a folder called duplicates inside the folder being checked can't be a good solution.

- What about errors?

- Newlines and " symbols are allowed in names of folders and files. Handling those are tricky in a shell script.

- Empty files will all be considered duplicates.

Maybe some solution exists, but I haven't seen any yet. I've used this script many times before. In my home directory, it usually reports lots of files that are duplicates. But there's a good reason. For example .htaccess files and so on. So I never trusted any script to just move or delete them.

Edit:

There's a bug in my alias. I think i've been playing with different checksum algorithms some time ago. When it's sha1, the length should be 40, not 32:

Code:
alias dupes="find . -type f -print0 | xargs -0 sha1sum | sort | uniq -D -w 40"

Last edited by Guttorm; 10-26-2020 at 10:11 AM.
 
2 members found this post helpful.
Old 10-29-2020, 09:55 PM   #5
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,681
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
if you modify it, i would suggest changing from sha512sum to md5sum. the latter is considered cryptographically weak but this is not a crypto use. you are not try to defend against an adversary with this. so, md5 is sufficient and much faster (or uses less CPU if your storage device is slow to read).
 
Old 10-29-2020, 11:53 PM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
If 'twas me, I'd be inclined to spend some effort only targetting same-sized files. Would save considerably more CPU, and the chances of two different size files being the same are vanishingly small. Maybe about the same as a hash intersection ...
 
Old 10-30-2020, 07:50 AM   #7
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 3,597

Rep: Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545Reputation: 2545
Quote:
Originally Posted by syg00 View Post
If 'twas me, I'd be inclined to spend some effort only targetting same-sized files. Would save considerably more CPU, and the chances of two different size files being the same are vanishingly small. Maybe about the same as a hash intersection ...
Depends on what counts as a duplicate: e.g if you have two identical photos but one has extra metadata attached, some people would say that's a duplicate.

Of course, as has been already pointed out, there are existing programs that no doubt can deal with all these sorts of intricacies in a performant way. No need to waste time re-writing bugs that have already been discovered and fixed.

 
Old 10-30-2020, 08:00 AM   #8
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,120

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Indeed.
 
Old 10-31-2020, 01:52 AM   #9
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,800

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by Guttorm View Post
- When there are duplicate files, should it move or delete them? And then which one of them? Random? And where? Moving it to a folder called duplicates inside the folder being checked can't be a good solution.
I have a dupefile script (originally Perl but migrating it to Python) that merely chucks everything into a database: one table for the checksum and path to the first file "owning" a particular checksum and a second table that holds the checksum and path to duplicates. There's a "unique" constraint on the checksum in the first table---violating that constraint means we've seen the checksum before so insert the current checksum and path to the second table.

Rather than deleting the duplicates -- which I figure is going to confuse the heck out of whatever application might be looking for them -- I go through the second table and create symbolic links pointing to the file in the first table that has the same checksum. I unleash this periodically on whole directory trees as I'm cleaning up a gazillion files that have accumulated for many years and been merged from multiple systems over that time. If I tried keeping the checksum list in memory (as the example script seems to do) my system would be thrashing in short order.

You can do most of this "by hand" as well. Collecting all the checksums/paths, extracting the checksums ("cat cksum.lis | cut -f1 -d' '"), sorting and keeping the ones corresponding to duplicate files -- i.e., those with a count >= 2 ("sort cksums | uniq -c | grep -v ' 1 ' | awk '{print $2}'") -- and using that checksum list to find all the files that correspond to each checksum. Keep the first one and move the others to your duplicates directory. Or delete them. Or... create the symbolic links as I've been doing. The scripts to massage the checksum list and deal with the duplicates are not terribly complex.

Good luck...
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
find duplicate files using bash script? b-RAM Linux - Newbie 4 06-08-2010 07:05 AM
does tar or bzip2 squash duplicate or near-duplicate files? garydale Linux - Software 6 11-19-2009 04:43 PM
Shell Script : Finding a duplicate Number from file ? avklinux Programming 8 12-16-2008 11:50 AM
Finding duplicate files SlowCoder Linux - General 6 10-12-2007 08:25 AM
Finding files and then finding content within those files... Maeltor Linux - Software 5 03-13-2007 12:06 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:45 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration