LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-17-2010, 09:18 AM   #1
rjo98
Senior Member
 
Registered: Jun 2009
Location: US
Distribution: RHEL, CentOS
Posts: 1,668

Rep: Reputation: 46
How to find duplicate files and delete all except most recent version


I have a directory containing a ton of photos, some of which are duplicates but just with different names. Is there any way in linux to find all the duplicates and remove all of them except the most recent version? I know on Windows there are utilities that will do this through a GUI, but I'm using Linux through the CLI only.
 
Old 08-17-2010, 09:53 AM   #2
kilgoretrout
Senior Member
 
Registered: Oct 2003
Posts: 2,474

Rep: Reputation: 189Reputation: 189
There are some bash scripts out there that do that. Here's one that I found ages ago called dupimage:

Code:
#!/bin/sh

CWD=`pwd`
SORTING=/tmp/sorting
OUTPUT=/tmp/filesfound
DELETE=/tmp/delete
DUPLICATE_DIR=~/duplicates
COUNT=0
F_COUNT=0
DIR_COUNT=0
EXT_COUNT=1

##################################################################
	if [ ! -d $DUPLICATE_DIR ]; then
		mkdir -p $DUPLICATE_DIR
	fi

##################################################################
# remove any previous output files
	rm -rf $OUTPUT
	rm -rf $SORTING
	rm -rf $DELETE

##################################################################
echo
echo "Duplicate Image Finder"
echo
echo "Press enter for current directory"
echo "Or enter directory path to scan: "
read ANSWER
if [ "$ANSWER" == "" ]; then
	ANSWER="$CWD"
fi

##################################################################

# find images
find $ANSWER -type f -name '*.[Jj][Pp][Gg]' >> $SORTING

IMAGES_TO_FIND=`cat $SORTING`
	for x in $IMAGES_TO_FIND; do 	# generate a md5sum value and sort each file found and add it to the output file
		COUNT=$(($COUNT + 1 ))
		MD5SUM=`md5sum $x | awk '{print $1}'`
		echo $MD5SUM $x >> $OUTPUT
	done

##################################################################

# find duplicates in output file
cat $OUTPUT | sort | uniq -w 32 -d --all-repeated=separate | sed -n '/^$/{p;h;};/./{x;/./p;}' | awk '{print $2}' >> $DELETE

FILES_TO_DELETE=`cat $DELETE`
	for FILE in $FILES_TO_DELETE; do
		NAME=`basename $FILE`
		F_COUNT=$(($F_COUNT + 1 ))
			if [ ! -e $DUPLICATE_DIR/$NAME ]; then # check to se if file name exist in duplicate directory before trying to move
				mv $FILE $DUPLICATE_DIR
			else
				# if file exists strip the file extension so we can rename the file with a -1 to the end
				ORG_NAME=`basename $FILE | cut -d "." -f 1` # get the name and strip off the file extension
				FILE_EXT=`basename $FILE | cut -d "." -f 2` # get the file extension type
				NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					while [ -e $DUPLICATE_DIR/$NEW_NAME ]; do
						EXT_COUNT=$(($EXT_COUNT + 1 ))
						NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					done
				mv $FILE $DUPLICATE_DIR/$NEW_NAME
			fi
	done

##################################################################
# remove empty directories if they exist
EMPTY_DIR=`find $ANSWER -depth -type d -empty`
	for EMPTY in $EMPTY_DIR; do
		D_COUNT=$(($DIR_COUNT + 1 ))
		rm -rf $EMPTY
	done

echo "Number of Files Checked: $COUNT"
echo "Number of duplicate files deleted/moved: $F_COUNT"
echo "Number of empty directories deleted: $DIR_COUNT "

##################################################################
Edit:Giving credit where credit is due, this is where I found this script along with a discussion:

http://www.linuxquestions.org/questi...-files-519144/

Last edited by kilgoretrout; 08-17-2010 at 10:08 AM.
 
1 members found this post helpful.
Old 08-17-2010, 09:57 AM   #3
rjo98
Senior Member
 
Registered: Jun 2009
Location: US
Distribution: RHEL, CentOS
Posts: 1,668

Original Poster
Rep: Reputation: 46
Thanks for the script. I have a question though, where in the script does it tell it to keep only the most current version of each duplicate?
 
Old 08-17-2010, 10:20 AM   #4
kilgoretrout
Senior Member
 
Registered: Oct 2003
Posts: 2,474

Rep: Reputation: 189Reputation: 189
I doesn't as far as I can see. It just compares md5sums of all files and removes all files with identical md5sums except for one of them. Actually it doesn't delete them; it just moves them to a duplicate image directory that it creates in your home directory. If they have identical md5sums they should be identical files, i.e. the same image, so the time stamp shouldn't matter.

Just rereading that thread that I posted in the Edit portion of my prior post, fotoguy improved his script from the version I posted. Here'e the revised script:

Code:
#!/bin/bash

CWD=`pwd`
FILESFOUND=/tmp/filesfound.txt
FILESSIZE=/tmp/filessize.txt
DUPLICATE_SETS_FOUND=/tmp/duplicate_sets_found.txt
DUPLICATES_TO_DELETE=/tmp/duplicates_to_delete.txt
DUPLICATE_DIR=~/duplicates
COUNT=0
F_COUNT=0
DIR_COUNT=0
EXT_COUNT=1

##################################################################
if [ ! -d $DUPLICATE_DIR ]; then
	mkdir -p $DUPLICATE_DIR
fi

##################################################################
# remove any previous output files
rm -rf $FILESSIZE
rm -rf $FILESFOUND
rm -rf $DUPLICATE_SETS_FOUND
rm -rf $DUPLICATES_TO_DELETE

##################################################################
echo
echo "Duplicate Image Finder"
echo
echo "Press enter for current directory"
echo "Or enter directory path to scan: "
read ANSWER
if [ "$ANSWER" == "" ]; then
	ANSWER="$CWD"
fi

##################################################################
# rename any directory name with spaces with an underscore
find $ANSWER -type d -iname '* *' -exec sh -c 'mv "$1" "${1// /_}"' -- {} \; 2> /dev/null
# rename any files name with spaces with an underscore
find $ANSWER -type f -iname '* *' -exec sh -c 'mv "$1" "${1// /_}"' -- {} \; 2> /dev/null

##################################################################
# find images
 for x in `find $ANSWER -type f -name "*.[Jj][Pp][Gg]"`; do
	COUNT=$(($COUNT + 1 ))
 	ls -l "$x" | awk '{print $5,$8}' >> $FILESFOUND
 	ls -l "$x" | awk '{print $5}' >> $FILESSIZE
done

# if no images files are found just exit script
if [ ! -e $FILESFOUND ] || [ ! -e $FILESSIZE ]; then
	echo "No image files found..........exiting"
	exit
fi

# find duplicate sets and remove one entry so as not to remove the original with subsequent duplicates
cat $FILESSIZE | sort | uniq -w 32 -d --all-repeated=separate | uniq > $DUPLICATE_SETS_FOUND
 for f in `cat $DUPLICATE_SETS_FOUND`; do
	grep "$f" "$FILESFOUND" | awk 'a ~ $1; {a=$1}' | awk '{print $2}' >> $DUPLICATES_TO_DELETE
done

#	if no duplicates are found exit script
if [ ! -e $DUPLICATES_TO_DELETE ]; then
	echo "Number of files scanned: $COUNT"
	echo "No duplicate files found"
	exit
fi

# instead of deleting move to the duplicate directory for inspection, just have to delete manually
for FILE in `cat $DUPLICATES_TO_DELETE`; do
	NAME=`basename $FILE`
	F_COUNT=$(($F_COUNT + 1 ))
			if [ ! -e $DUPLICATE_DIR/$NAME ]; then # check to se if file name exist in duplicate directory before trying to move
				mv $FILE $DUPLICATE_DIR
			else
				# if file exists strip the file extension so we can rename the file with a -1 to the end
				ORG_NAME=`basename $FILE | cut -d "." -f 1` # get the name and strip off the file extension
				FILE_EXT=`basename $FILE | cut -d "." -f 2` # get the file extension type
				NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					while [ -e $DUPLICATE_DIR/$NEW_NAME ]; do
						EXT_COUNT=$(($EXT_COUNT + 1 ))
						NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					done
				mv $FILE $DUPLICATE_DIR/$NEW_NAME
			fi
done

##################################################################
# remove empty directories if they exist
 EMPTY_DIR=`find $ANSWER -depth -type d -empty`
 	for EMPTY in $EMPTY_DIR; do
 		DIR_COUNT=$(($DIR_COUNT + 1 ))
 		rm -rf $EMPTY
 	done

echo "Number of Files Checked: $COUNT"
echo "Number of duplicate files deleted/moved: $F_COUNT"
echo "Number of empty directories deleted: $DIR_COUNT "

##################################################################

Last edited by kilgoretrout; 08-17-2010 at 10:39 AM.
 
Old 08-17-2010, 11:13 AM   #5
rjo98
Senior Member
 
Registered: Jun 2009
Location: US
Distribution: RHEL, CentOS
Posts: 1,668

Original Poster
Rep: Reputation: 46
Thanks
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Find duplicate files by name xzased Linux - General 10 12-05-2012 07:31 AM
Find Duplicate Files caponewgp Linux - Newbie 9 09-10-2009 01:20 AM
LXer: fdupes - Command line tool to find and list/delete duplicate files LXer Syndicated Linux News 0 10-28-2008 04:40 PM
find and move most recent files in dir backnine_99 Linux - Software 2 06-14-2005 01:03 PM
Howto find duplicate files js72 Linux - Software 1 11-09-2003 05:55 AM


All times are GMT -5. The time now is 12:33 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration