Finding duplicate files
I managed to restore about 5000 photos from my corrupted hard drive using photorec. This is great news! However, I've ended up with multiple copies of the same photos, sometimes 5 or 6 copies.
I would like to know if there is a command that I can use to locate duplicate copies, then perform some function, such as move only one of the copies to another location? |
fdupes will find identical files. findimagedupes will find images which are similar.
|
Or you could do it manually...
Code:
tmp=$(mktemp) |
You can give each file a unique ID by computing a random function on it. Most machines have md5 or md5sum on them:
Test: ~> md5 < myfile 308005f025274f70b92d26cffd0d4185 ~> md5sum < myfile 308005f025274f70b92d26cffd0d4185 - The chance of two photos having the same md5 is quite astronomically small, so you can assume that two photos are the same iff their md5 sums are the same. The rest is just a scripting exercise. This will take a wee while to run on 5000 photos but you can speed it up if you refine it a bit. Or you could just resign yourself to leaving your machine running overnight. Code:
for photo in $(find /the/place/I/put/them) |
Thank for the reply, guys! I'll give each a try.
|
Ok, I'm definitely making forward progress on this. I've md5'd all of my images into a list, and have sorted.
Now I'm trying to figure out how the -k operation works, or rather what it's function is. If I used sort -u -k1,32 would sort use only the first 32 characters of each line to determine uniqueness? Edit: Ok, I think I figured it out. -kx,y ... x is the field number (as separated by spaces), y is the character position within the field. Right? P.S. It only took my computer about 1 minute to md5 all 2.3G of images. :) |
Yay! I got it all working! Thanks!
|
All times are GMT -5. The time now is 06:25 PM. |