LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   Finding duplicate files (https://www.linuxquestions.org/questions/linux-general-1/finding-duplicate-files-590920/)

SlowCoder 10-10-2007 08:09 PM

Finding duplicate files
 
I managed to restore about 5000 photos from my corrupted hard drive using photorec. This is great news! However, I've ended up with multiple copies of the same photos, sometimes 5 or 6 copies.

I would like to know if there is a command that I can use to locate duplicate copies, then perform some function, such as move only one of the copies to another location?

matthewg42 10-10-2007 09:01 PM

fdupes will find identical files. findimagedupes will find images which are similar.

matthewg42 10-10-2007 09:11 PM

Or you could do it manually...

Code:

tmp=$(mktemp)
find . -type f |xargs md5sum > $tmp
awk '{ print $1 }' $tmp |sort |uniq -d |while read f; do
    grep "^$f" $tmp
    echo ""
done

...which will list identical files in groups. You can then do what you want with the list.

Tischbein 10-10-2007 09:26 PM

You can give each file a unique ID by computing a random function on it. Most machines have md5 or md5sum on them:

Test:
~> md5 < myfile
308005f025274f70b92d26cffd0d4185
~> md5sum < myfile
308005f025274f70b92d26cffd0d4185 -

The chance of two photos having the same md5 is quite astronomically small, so you can assume that two photos are the same iff their md5 sums are the same. The rest is just a scripting exercise. This will take a wee while to run on 5000 photos but you can speed it up if you refine it a bit. Or you could just resign yourself to leaving your machine running overnight.

Code:

for photo in $(find /the/place/I/put/them)
do
    unique_id="$(md5 < "$photo")"
    echo "${unique_id:0:32} $photo" > database
done


sort -u -k1,1 database > a_list_of_unique_photos


while read line
do
  photo="${line:33}"
  cp "$photo" /a/directory/of/my/choice
done < a_list_of_unique_photos

Speedups would include e.g. reading the only the first 200 bytes from each file using dd. Depending on what version of md5 you have you can simplify the code as well. Good luck!

SlowCoder 10-11-2007 06:32 AM

Thank for the reply, guys! I'll give each a try.

SlowCoder 10-11-2007 12:36 PM

Ok, I'm definitely making forward progress on this. I've md5'd all of my images into a list, and have sorted.

Now I'm trying to figure out how the -k operation works, or rather what it's function is. If I used sort -u -k1,32 would sort use only the first 32 characters of each line to determine uniqueness?

Edit: Ok, I think I figured it out. -kx,y ... x is the field number (as separated by spaces), y is the character position within the field. Right?

P.S. It only took my computer about 1 minute to md5 all 2.3G of images. :)

SlowCoder 10-12-2007 08:25 AM

Yay! I got it all working! Thanks!


All times are GMT -5. The time now is 06:25 PM.