LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 10-10-2007, 09:09 PM   #1
SlowCoder
Senior Member
 
Registered: Oct 2004
Location: Southeast, U.S.A.
Distribution: Debian based
Posts: 1,250

Rep: Reputation: 164Reputation: 164
Finding duplicate files


I managed to restore about 5000 photos from my corrupted hard drive using photorec. This is great news! However, I've ended up with multiple copies of the same photos, sometimes 5 or 6 copies.

I would like to know if there is a command that I can use to locate duplicate copies, then perform some function, such as move only one of the copies to another location?
 
Old 10-10-2007, 10:01 PM   #2
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
fdupes will find identical files. findimagedupes will find images which are similar.
 
Old 10-10-2007, 10:11 PM   #3
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
Or you could do it manually...

Code:
tmp=$(mktemp)
find . -type f |xargs md5sum > $tmp
awk '{ print $1 }' $tmp |sort |uniq -d |while read f; do 
    grep "^$f" $tmp
    echo ""
done
...which will list identical files in groups. You can then do what you want with the list.
 
Old 10-10-2007, 10:26 PM   #4
Tischbein
Member
 
Registered: Oct 2006
Distribution: debian
Posts: 124

Rep: Reputation: 15
You can give each file a unique ID by computing a random function on it. Most machines have md5 or md5sum on them:

Test:
~> md5 < myfile
308005f025274f70b92d26cffd0d4185
~> md5sum < myfile
308005f025274f70b92d26cffd0d4185 -

The chance of two photos having the same md5 is quite astronomically small, so you can assume that two photos are the same iff their md5 sums are the same. The rest is just a scripting exercise. This will take a wee while to run on 5000 photos but you can speed it up if you refine it a bit. Or you could just resign yourself to leaving your machine running overnight.

Code:
for photo in $(find /the/place/I/put/them)
do
    unique_id="$(md5 < "$photo")"
    echo "${unique_id:0:32} $photo" > database
done


sort -u -k1,1 database > a_list_of_unique_photos


while read line
do
   photo="${line:33}"
   cp "$photo" /a/directory/of/my/choice
done < a_list_of_unique_photos
Speedups would include e.g. reading the only the first 200 bytes from each file using dd. Depending on what version of md5 you have you can simplify the code as well. Good luck!
 
Old 10-11-2007, 07:32 AM   #5
SlowCoder
Senior Member
 
Registered: Oct 2004
Location: Southeast, U.S.A.
Distribution: Debian based
Posts: 1,250

Original Poster
Rep: Reputation: 164Reputation: 164
Thank for the reply, guys! I'll give each a try.
 
Old 10-11-2007, 01:36 PM   #6
SlowCoder
Senior Member
 
Registered: Oct 2004
Location: Southeast, U.S.A.
Distribution: Debian based
Posts: 1,250

Original Poster
Rep: Reputation: 164Reputation: 164
Ok, I'm definitely making forward progress on this. I've md5'd all of my images into a list, and have sorted.

Now I'm trying to figure out how the -k operation works, or rather what it's function is. If I used sort -u -k1,32 would sort use only the first 32 characters of each line to determine uniqueness?

Edit: Ok, I think I figured it out. -kx,y ... x is the field number (as separated by spaces), y is the character position within the field. Right?

P.S. It only took my computer about 1 minute to md5 all 2.3G of images.

Last edited by SlowCoder; 10-11-2007 at 03:10 PM.
 
Old 10-12-2007, 09:25 AM   #7
SlowCoder
Senior Member
 
Registered: Oct 2004
Location: Southeast, U.S.A.
Distribution: Debian based
Posts: 1,250

Original Poster
Rep: Reputation: 164Reputation: 164
Yay! I got it all working! Thanks!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding files and then finding content within those files... Maeltor Linux - Software 5 03-13-2007 01:06 PM
deleting duplicate files cs-cam Linux - General 3 11-15-2006 12:27 AM
editors and duplicate files printf Linux - Newbie 7 11-22-2005 04:54 AM
duplicate files in one folder! hornung Linux - Enterprise 1 01-13-2005 04:35 PM
Duplicate Files and linux carl0ski Linux - Software 1 12-22-2004 05:45 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 12:13 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration