Script to find the duplicate files

Tekken · 09-16-2009, 04:36 AM

Hi,

Can any one help me with the script which can be used to find the duplicate files in a directory. I have directory called dir1 which has some sub-directories and there are some .i and .o files in it. There are some duplicate files in the different directories. I want to identify them and copy the files to other directory leaving the duplicates in the same dir1.

catkin · 09-16-2009, 05:06 AM

How can you identify the duplicates? Is the name enough or do you also need to check the size and/or checksum?

I do not understand There are some duplicate files in the different directories. I want to identify them and copy the files to other directory leaving the duplicates in the same dir1. When you identify a file in a sub-directory of dir1 which is a duplicate of a file in dir1, which "other" directory do you want to copy it to and is it OK to change from having two duplicates to having 3 duplicates?

Tekken · 09-16-2009, 10:27 AM

Quote:

Originally Posted by catkin

How can you identify the duplicates? Is the name enough or do you also need to check the size and/or checksum?

I do not understand There are some duplicate files in the different directories. I want to identify them and copy the files to other directory leaving the duplicates in the same dir1. When you identify a file in a sub-directory of dir1 which is a duplicate of a file in dir1, which "other" directory do you want to copy it to and is it OK to change from having two duplicates to having 3 duplicates?

Hi Catkin,

Thanks for your reply.

Identifying duplicate files with name will help me in first place. I have a directory in which i have many sub-directories under which i have *.i and *.o files. Now i want to identify the duplicate files and do the following,

1. If the files are with same name then it has to display the location of the files with same name and store the information to some file.

2. After that it has to check for the contents of file and if the contents of the file are same it has to delete one copy and retain the other so i don't have a duplicate copy with same content.

Hope you understood what i am looking for, could please help me here?

Thanks,
Tekken.

catkin · 09-16-2009, 11:04 AM

Idea for a brute force way (OK if not doing often on large number of files), not tested

Code:

find dir1 -type f \( -name '*.i' -o -name '*.o' \) -print0 | while IFS= read -r -d '' filename1  # Note 1
do
    basename=<stuff> # Remove the path from $filename1, leaving only the basename
    count=0
    find dir1 -mindepth 2 -type f -name "$basename" -print0 | while IFS= read -r -d '' filename2
    do
        echo "'$filename1' duplicate found at '$filename2'" >> output.txt
        let count++   
    done
    if [[ $count -eq 1 ]]; then
        <do file comparing, moving or deleting stuff stuff>
    fi
done

Notes:

Using robust method described here

Tekken · 09-17-2009, 03:41 AM

Quote:

Originally Posted by catkin

Idea for a brute force way (OK if not doing often on large number of files), not tested

Code:

find dir1 -type f \( -name '*.i' -o -name '*.o' \) -print0 | while IFS= read -r -d '' filename1  # Note 1
do
    basename=<stuff> # Remove the path from $filename1, leaving only the basename
    count=0
    find dir1 -mindepth 2 -type f -name "$basename" -print0 | while IFS= read -r -d '' filename2
    do
        echo "'$filename1' duplicate found at '$filename2'" >> output.txt
        let count++   
    done
    if [[ $count -eq 1 ]]; then
        <do file comparing, moving or deleting stuff stuff>
    fi
done

Notes:

Using robust method described here

Thanks for the script catkin,But to be frank i am new to linux and i dont know what all dose this script do.

What should i specify at basename=<stuff> and in the if;then construct?

Kenhelm · 09-18-2009, 12:46 AM

This creates a list of all the duplicate filenames found in a directory.
It's an improved version of the code which is explained on this thread
http://www.linuxquestions.org/questi...d.php?t=752129

Code:

find /path/to/dir -type f |
rev | sort | sed -nr ':a N;/^([^/]*\/).*\n\1/p;D;ba' | uniq | rev

David the H. · 03-30-2013, 11:29 AM

There's a small application called fdupes that does exactly what the OP describes.

http://code.google.com/p/fdupes/