How could I find and remove duplicate files in Slackware by the terminal?

cesarsj · 07-12-2019, 06:45 PM

The command below saves the list of duplicate files to a file.

find . -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 20 > /home/cesarmsj/duplicate_files.txt

Now, I would like to remove duplicate files found, how could I do this ??

I would also like to know the total in MBytes of duplicate files found

BW-userx · 07-12-2019, 06:59 PM

are the names the same in different area's? if you got your list, just run it through a loop and call to remove files.

Code:

while read f ; do rm "$f" ; done < file

if you have absolute path to them files in your list of files.
this might give you some ideas on how to add up or tally total amount of mb
https://www.cyberciti.biz/tips/linux...-examples.html

scasey · 07-12-2019, 08:13 PM

Please post the first few lines (5-10) of duplicate_files.txt.

Are the duplicates in the same directory?
Do they have the same name?

syg00 · 07-12-2019, 08:18 PM

Perhaps you should use one of the many tools that do this - usually you can get it to delete, or list attributes such as size, or a bunch of other useful things. For example fdupes.

rnturn · 07-13-2019, 10:30 AM

Quote:

Originally Posted by cesarsj

The command below saves the list of duplicate files to a file.

find . -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 20 > /home/cesarmsj/duplicate_files.txt

Now, I would like to remove duplicate files found, how could I do this ??

I've done something like on my systems. The exception is that I'm too paranoid about deleting the duplicates and breaking something that relies on that particular file being in that particular directory---I'm replacing them with symbolic links back to the first copy encountered.

I'm using a Perl script to do it (way too big to post here as it involves stuffing checksums/filepaths into a Pg database and uses SQL to pull out the information I need. I run this against several multi TB disks and the database speeds things up considerably) but from the command line, you could take the results of your checksum gathering and save it into a file, say, "file.checksums". Sort it if you like. Then extract all the checksums from that file, sort them, and run them through uniq, obtain the number of occurances, and save that result into a file ("checksums.count"). You would then scan through that list looking for any checksums that occur more than once. Then grep "file.checksums" for all the records that contain that checksum. This is the list of "original + duplicates" you need to work with. In my script, I select the first occurance as the "master" file. All of the others are files that I'll delete and replace with a symbolic link pointing to the "master". Then just continue looping through the "checksums.count" list.

It's not a one-liner so some fun with scripting is involved.

While you're writing this, though, I'd include a provision to build up the commands that are going to touch the files and display the commands when some variable, say "DRYRUN", is set to "true". Closely examine the commands that are being generated and make sure they're not doing something unintended before turning them loose on your filesystem (DRYRUN = false). I.e.,

Code:

CMD=" ... "
if [ ${DRYRUN} ]; then
    echo "${CMD}"
else
    ${CMD}
fi

I've used a bash function to provide this flexibility.

When doing the dry run, pay attention to what happens when you encounter files with spaces in the names. ($DIETY how I hate 'em.)

Quote:

I would also like to know the total in MBytes of duplicate files found

The simplest way to do this would be, IMHO, before you delete each file issue the command

Quote:

wc -c filepath

and append the results to a file. Obviously, you'll want to ensure that this file is empty beforehand. You can process the contents of that separately once the file removals are complete. You might consider setting you your script to accumulate this data even when DRYRUN is set so you have an idea of how much space you'll recover before even doing the removals.

Good luck. And remember: Backups are your friend.

When you're all done, you'll want to ask yourself the same question I do: How the heck did I get all these duplicate files in the first place?

BW-userx · 07-13-2019, 11:14 AM

as far as total mb, check my math, but here, this might work

Code:

#!/bin/bash

path=$HOME/bin

total=0
while read f 
do
	total=$((total + f))
done < <(find "$path" -type f -exec du {} \; | awk '{print $1}' )
echo "$((total/1024)) MB"

if you got your files with absolute path you can just change how you read in the files the awk it to get just the first column to tally up the numbers.

joshefjoon · 03-22-2024, 06:44 AM

Users can directly download this tool on their Windows latest and previous versions.

allend · 03-22-2024, 09:11 AM

Junk reply to 5 year old thread has been reported.