Quote:
Originally Posted by cesarsj
The command below saves the list of duplicate files to a file.
find . -type f -exec md5sum '{}' ';' | sort | uniq --all-repeated=separate -w 20 > /home/cesarmsj/duplicate_files.txt
Now, I would like to remove duplicate files found, how could I do this ??
|
I've done something like on my systems. The exception is that I'm too paranoid about deleting the duplicates and breaking something that relies on that particular file being in that particular directory---I'm replacing them with symbolic links back to the first copy encountered.
I'm using a Perl script to do it (way too big to post here as it involves stuffing checksums/filepaths into a Pg database and uses SQL to pull out the information I need. I run this against several multi TB disks and the database speeds things up considerably) but from the command line, you could take the results of your checksum gathering and save it into a file, say, "file.checksums". Sort it if you like. Then extract all the checksums from that file, sort them, and run them through uniq, obtain the number of occurances, and save that result into a file ("checksums.count"). You would then scan through
that list looking for any checksums that occur more than once. Then grep "file.checksums" for all the records that contain that checksum. This is the list of "original + duplicates" you need to work with. In my script, I select the first occurance as the "master" file. All of the others are files that I'll delete and replace with a symbolic link pointing to the "master". Then just continue looping through the "checksums.count" list.
It's not a one-liner so some fun with scripting is involved.
While you're writing this, though, I'd include a provision to build up the commands that are going to touch the files and display the commands when some variable, say "DRYRUN", is set to "true". Closely examine the commands that are being generated and make sure they're not doing something unintended before turning them loose on your filesystem (DRYRUN = false). I.e.,
Code:
CMD=" ... "
if [ ${DRYRUN} ]; then
echo "${CMD}"
else
${CMD}
fi
I've used a bash function to provide this flexibility.
When doing the dry run, pay attention to what happens when you encounter files with spaces in the names. ($DIETY how I hate 'em.)
Quote:
I would also like to know the total in MBytes of duplicate files found
|
The simplest way to do this would be, IMHO, before you delete each file issue the command
and append the results to a file. Obviously, you'll want to ensure that this file is empty beforehand. You can process the contents of that separately once the file removals are complete. You might consider setting you your script to accumulate this data even when DRYRUN is set so you have an idea of how much space you'll recover before even doing the removals.
Good luck. And remember: Backups are your friend.
When you're all done, you'll want to ask yourself the same question I do:
How the heck did I get all these duplicate files in the first place?