LinuxQuestions.org - Finding duplicate files

- Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)

- - Finding duplicate files (https://www.linuxquestions.org/questions/linux-general-1/finding-duplicate-files-590920/)

Finding duplicate files

I managed to restore about 5000 photos from my corrupted hard drive using photorec. This is great news! However, I've ended up with multiple copies of the same photos, sometimes 5 or 6 copies.

I would like to know if there is a command that I can use to locate duplicate copies, then perform some function, such as move only one of the copies to another location?

fdupes will find identical files. findimagedupes will find images which are similar.

Or you could do it manually...

Code:

tmp=$(mktemp)

find . -type f |xargs md5sum > $tmp

awk '{ print $1 }' $tmp |sort |uniq -d |while read f; do 

    grep "^$f" $tmp

    echo ""

done

...which will list identical files in groups. You can then do what you want with the list.

You can give each file a unique ID by computing a random function on it. Most machines have md5 or md5sum on them:

Test:
~> md5 < myfile
308005f025274f70b92d26cffd0d4185
~> md5sum < myfile
308005f025274f70b92d26cffd0d4185 -

The chance of two photos having the same md5 is quite astronomically small, so you can assume that two photos are the same iff their md5 sums are the same. The rest is just a scripting exercise. This will take a wee while to run on 5000 photos but you can speed it up if you refine it a bit. Or you could just resign yourself to leaving your machine running overnight.

Code:

for photo in $(find /the/place/I/put/them)

do

    unique_id="$(md5 < "$photo")"

    echo "${unique_id:0:32} $photo" > database

done





sort -u -k1,1 database > a_list_of_unique_photos





while read line

do

  photo="${line:33}"

  cp "$photo" /a/directory/of/my/choice

done < a_list_of_unique_photos

Speedups would include e.g. reading the only the first 200 bytes from each file using dd. Depending on what version of md5 you have you can simplify the code as well. Good luck!

Thank for the reply, guys! I'll give each a try.

Ok, I'm definitely making forward progress on this. I've md5'd all of my images into a list, and have sorted.

Now I'm trying to figure out how the -k operation works, or rather what it's function is. If I used sort -u -k1,32 would sort use only the first 32 characters of each line to determine uniqueness?

Edit: Ok, I think I figured it out. -kx,y ... x is the field number (as separated by spaces), y is the character position within the field. Right?

P.S. It only took my computer about 1 minute to md5 all 2.3G of images. :)

Yay! I got it all working! Thanks!