LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 01-14-2007, 06:26 AM   #1
fotoguy
Senior Member
 
Registered: Mar 2003
Location: Brisbane Queensland Australia
Distribution: KirraMail Live Email Server
Posts: 1,279

Rep: Reputation: 61
A bash script to find duplicate image files


I have had trouble finding any imformation on how to write a bash script to find duplicate images files in a directory and remove then. Finally I was able to make a script that will achieve this so I thought others may also find it useful, it's by no means perfect and could be modified to find other files as well. Also someone maybe able to write it better or just suggest some improvements

Code:
#!/bin/sh

CWD=`pwd`
SORTING=/tmp/sorting
OUTPUT=/tmp/filesfound
DELETE=/tmp/delete
DUPLICATE_DIR=~/duplicates
COUNT=0
F_COUNT=0
DIR_COUNT=0
EXT_COUNT=1

##################################################################
	if [ ! -d $DUPLICATE_DIR ]; then
		mkdir -p $DUPLICATE_DIR
	fi

##################################################################
# remove any previous output files
	rm -rf $OUTPUT
	rm -rf $SORTING
	rm -rf $DELETE

##################################################################
echo
echo "Duplicate Image Finder"
echo
echo "Press enter for current directory"
echo "Or enter directory path to scan: "
read ANSWER
if [ "$ANSWER" == "" ]; then
	ANSWER="$CWD"
fi

##################################################################

# find images
find $ANSWER -type f -name '*.[Jj][Pp][Gg]' >> $SORTING

IMAGES_TO_FIND=`cat $SORTING`
	for x in $IMAGES_TO_FIND; do 	# generate a md5sum value and sort each file found and add it to the output file
		COUNT=$(($COUNT + 1 ))
		MD5SUM=`md5sum $x | awk '{print $1}'`
		echo $MD5SUM $x >> $OUTPUT
	done

##################################################################

# find duplicates in output file
cat $OUTPUT | sort | uniq -w 32 -d --all-repeated=separate | sed -n '/^$/{p;h;};/./{x;/./p;}' | awk '{print $2}' >> $DELETE

FILES_TO_DELETE=`cat $DELETE`
	for FILE in $FILES_TO_DELETE; do
		NAME=`basename $FILE`
		F_COUNT=$(($F_COUNT + 1 ))
			if [ ! -e $DUPLICATE_DIR/$NAME ]; then # check to se if file name exist in duplicate directory before trying to move
				mv $FILE $DUPLICATE_DIR
			else
				# if file exists strip the file extension so we can rename the file with a -1 to the end
				ORG_NAME=`basename $FILE | cut -d "." -f 1` # get the name and strip off the file extension
				FILE_EXT=`basename $FILE | cut -d "." -f 2` # get the file extension type
				NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					while [ -e $DUPLICATE_DIR/$NEW_NAME ]; do
						EXT_COUNT=$(($EXT_COUNT + 1 ))
						NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					done
				mv $FILE $DUPLICATE_DIR/$NEW_NAME
			fi
	done

##################################################################
# remove empty directories if they exist
EMPTY_DIR=`find $ANSWER -depth -type d -empty`
	for EMPTY in $EMPTY_DIR; do
		D_COUNT=$(($DIR_COUNT + 1 ))
		rm -rf $EMPTY
	done

echo "Number of Files Checked: $COUNT"
echo "Number of duplicate files deleted/moved: $F_COUNT"
echo "Number of empty directories deleted: $DIR_COUNT "

##################################################################
 
Old 01-14-2007, 11:17 AM   #2
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: Ubuntu, FreeBSD, NetBSD
Posts: 1,449

Rep: Reputation: 47
Note: we've done this before somewhere on this forum (at least I'm pretty sure it was this one)... and we found that if you sort the files by size first... and only do the md5 for files which match sizes, you'll speed up the program by an impressive amount.

md5 calculations are very processor intensive... and if a file has a different size from all the others then you know it's not a duplicate (even if it was you wouldn't match it with an md5 sum if it was incomplete).

This could speed up the script by a factor of 10 or more.
 
Old 01-14-2007, 01:27 PM   #3
jlliagre
Moderator
 
Registered: Feb 2004
Location: Outside Paris
Distribution: Solaris10, Solaris 11, Mint, OL
Posts: 9,502

Rep: Reputation: 357Reputation: 357Reputation: 357Reputation: 357
Okay, your script works quite well under Linux, but there are some areas for improvement, especially to make it portable in other environments and handle some specific situations.

It states it uses /bin/sh, but is really using bash (or POSIX) syntax.

It uses "uniq", "sed" and "find" non portable GNU specific extensions.

It doesn't properly handle the situation where no jpeg files are present, nor the case where a space character is found in their names.
 
Old 01-14-2007, 04:07 PM   #4
uselpa
Senior Member
 
Registered: Oct 2004
Location: Luxemburg
Distribution: Slackware, OS X
Posts: 1,507

Rep: Reputation: 46
I wrote something similar in Python, though it's not specifically for image files. It's a little more complex but it might give you some ideas.

The suggestion about only comparing files of the same size is a real time saver. Also, my script doesn't use MD5 because I'm paranoid (my other OS is OpenBSD :-)) and theoretically different files could yield the same hash value. It knows about soft and hard links which might come in handy.

You can get it here.
 
Old 01-14-2007, 07:05 PM   #5
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
Also something I've done the the past, using perl (to find duplicates in the Open Clipart Library). Seems like a common enough problem that it might be beneficial if there were some really useful generic implementation to include it as default with many distros. I dunno if there's already a package for it - anyone know a good implementation?
 
Old 01-14-2007, 07:34 PM   #6
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
You may find this article interesting if you also want to find similar images using ImageMagick: http://www.cit.gu.edu.au/~anthony/gr...gick6/compare/
 
Old 01-14-2007, 09:42 PM   #7
fotoguy
Senior Member
 
Registered: Mar 2003
Location: Brisbane Queensland Australia
Distribution: KirraMail Live Email Server
Posts: 1,279

Original Poster
Rep: Reputation: 61
Quote:
Originally Posted by frob23
Note: we've done this before somewhere on this forum (at least I'm pretty sure it was this one)... and we found that if you sort the files by size first... and only do the md5 for files which match sizes, you'll speed up the program by an impressive amount.

md5 calculations are very processor intensive... and if a file has a different size from all the others then you know it's not a duplicate (even if it was you wouldn't match it with an md5 sum if it was incomplete).

This could speed up the script by a factor of 10 or more.

Ok I would never had thought to use the file size, that would definetly make more sense, since file sizes are already present so you don't have to do anything extra to generate the sizes. I did notice that generating a md5 value was extremely processor intensive, so anything to reduce the work load would definetly be worth implementing, thanks that's an awesome tip.


Quote:
Originally Posted by jlliagre
Okay, your script works quite well under Linux, but there are some areas for improvement, especially to make it portable in other environments and handle some specific situations.

It states it uses /bin/sh, but is really using bash (or POSIX) syntax.

It uses "uniq", "sed" and "find" non portable GNU specific extensions.

It doesn't properly handle the situation where no jpeg files are present, nor the case where a space character is found in their names.
Yes not very portable, unfortunately I only use linux at the moment so portability will take a back seat, although I was thinking of having a go at making a program through either KDevelope or QT to make a little collection manager using c++, then sed, grep and find would have to be replace with something crossplatform compatable. Also have made a note about the finding no files and spaces in filenames, great thanks for that.



Quote:
Originally Posted by uselpa
I wrote something similar in Python, though it's not specifically for image files. It's a little more complex but it might give you some ideas.

The suggestion about only comparing files of the same size is a real time saver. Also, my script doesn't use MD5 because I'm paranoid (my other OS is OpenBSD :-)) and theoretically different files could yield the same hash value. It knows about soft and hard links which might come in handy.

You can get it here.
I know nothing about python but I will definetely take a lok at it for some hints, thanks for that



Well all I can say is thanks to everyone for the feedback and tips, I can go away now and improve on it some more.
 
Old 01-25-2007, 06:47 PM   #8
fotoguy
Senior Member
 
Registered: Mar 2003
Location: Brisbane Queensland Australia
Distribution: KirraMail Live Email Server
Posts: 1,279

Original Poster
Rep: Reputation: 61
Have something alittle better now, not perfect but seems to do the job.

Code:
#!/bin/bash

CWD=`pwd`
FILESFOUND=/tmp/filesfound.txt
FILESSIZE=/tmp/filessize.txt
DUPLICATE_SETS_FOUND=/tmp/duplicate_sets_found.txt
DUPLICATES_TO_DELETE=/tmp/duplicates_to_delete.txt
DUPLICATE_DIR=~/duplicates
COUNT=0
F_COUNT=0
DIR_COUNT=0
EXT_COUNT=1

##################################################################
if [ ! -d $DUPLICATE_DIR ]; then
	mkdir -p $DUPLICATE_DIR
fi

##################################################################
# remove any previous output files
rm -rf $FILESSIZE
rm -rf $FILESFOUND
rm -rf $DUPLICATE_SETS_FOUND
rm -rf $DUPLICATES_TO_DELETE

##################################################################
echo
echo "Duplicate Image Finder"
echo
echo "Press enter for current directory"
echo "Or enter directory path to scan: "
read ANSWER
if [ "$ANSWER" == "" ]; then
	ANSWER="$CWD"
fi

##################################################################
# rename any directory name with spaces with an underscore
find $ANSWER -type d -iname '* *' -exec sh -c 'mv "$1" "${1// /_}"' -- {} \; 2> /dev/null
# rename any files name with spaces with an underscore
find $ANSWER -type f -iname '* *' -exec sh -c 'mv "$1" "${1// /_}"' -- {} \; 2> /dev/null

##################################################################
# find images
 for x in `find $ANSWER -type f -name "*.[Jj][Pp][Gg]"`; do
	COUNT=$(($COUNT + 1 ))
 	ls -l "$x" | awk '{print $5,$8}' >> $FILESFOUND
 	ls -l "$x" | awk '{print $5}' >> $FILESSIZE
done

# if no images files are found just exit script
if [ ! -e $FILESFOUND ] || [ ! -e $FILESSIZE ]; then
	echo "No image files found..........exiting"
	exit
fi

# find duplicate sets and remove one entry so as not to remove the original with subsequent duplicates
cat $FILESSIZE | sort | uniq -w 32 -d --all-repeated=separate | uniq > $DUPLICATE_SETS_FOUND
 for f in `cat $DUPLICATE_SETS_FOUND`; do
	grep "$f" "$FILESFOUND" | awk 'a ~ $1; {a=$1}' | awk '{print $2}' >> $DUPLICATES_TO_DELETE
done

#	if no duplicates are found exit script
if [ ! -e $DUPLICATES_TO_DELETE ]; then
	echo "Number of files scanned: $COUNT"
	echo "No duplicate files found"
	exit
fi

# instead of deleting move to the duplicate directory for inspection, just have to delete manually
for FILE in `cat $DUPLICATES_TO_DELETE`; do
	NAME=`basename $FILE`
	F_COUNT=$(($F_COUNT + 1 ))
			if [ ! -e $DUPLICATE_DIR/$NAME ]; then # check to se if file name exist in duplicate directory before trying to move
				mv $FILE $DUPLICATE_DIR
			else
				# if file exists strip the file extension so we can rename the file with a -1 to the end
				ORG_NAME=`basename $FILE | cut -d "." -f 1` # get the name and strip off the file extension
				FILE_EXT=`basename $FILE | cut -d "." -f 2` # get the file extension type
				NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					while [ -e $DUPLICATE_DIR/$NEW_NAME ]; do
						EXT_COUNT=$(($EXT_COUNT + 1 ))
						NEW_NAME="$ORG_NAME-$EXT_COUNT.$FILE_EXT"
					done
				mv $FILE $DUPLICATE_DIR/$NEW_NAME
			fi
done

##################################################################
# remove empty directories if they exist
 EMPTY_DIR=`find $ANSWER -depth -type d -empty`
 	for EMPTY in $EMPTY_DIR; do
 		DIR_COUNT=$(($DIR_COUNT + 1 ))
 		rm -rf $EMPTY
 	done

echo "Number of Files Checked: $COUNT"
echo "Number of duplicate files deleted/moved: $F_COUNT"
echo "Number of empty directories deleted: $DIR_COUNT "

##################################################################
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using Bash, Find script files in a directory or subdirectories within... ray5_83 Programming 4 10-10-2008 07:42 PM
Script to find duplicate files within one or more directories peter88 Linux - General 6 12-10-2006 05:17 AM
bash: identify duplicate MS Outlook .msg files? morrolan Programming 10 10-26-2006 10:46 AM
Bash script to sort image files dtcs Programming 5 09-26-2006 09:50 PM
Trying to find the files that contain distro information for my bash script ethics Linux - General 5 07-04-2006 04:48 AM


All times are GMT -5. The time now is 01:35 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration