LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   Find duplicate files by name (http://www.linuxquestions.org/questions/linux-general-1/find-duplicate-files-by-name-821817/)

xzased 07-24-2010 12:14 AM

Find duplicate files by name
 
Hi, we have a huge amount of duplicate files in a folder and I would like some pointers on to writing a bash script to create a list of the duplicate files. I've seen examples that check for the md5 sum of files... but I dont need that, the file name is enough. Can someone please help me?

Telengard 07-24-2010 12:29 AM

Quote:

Originally Posted by xzased (Post 4043637)
Hi, we have a huge amount of duplicate files in a folder and I would like some pointers on to writing a bash script to create a list of the duplicate files ... the file name is enough.

Umm, it should not be possible to have two files with the same name in the same folder.

xzased 07-24-2010 12:59 AM

LOL! True Mr. Telengard. I meant in subfolders. So I have the directory /storage which holds about 10 subfolders which hold around 3 more subfolders each with around 300 + files in. Messy, I know. So the duplicates are between subfolders.

Telengard 07-24-2010 01:06 AM

Okay, this has had only very minimal testing so use at your own discretion. As always, it is your responsibility to evaluate this code's suitability for your purposes.

Code:

#!/bin/bash

find -type f > names.lst

while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
        echo "$name"
        echo "$name2"
    fi
done < names.lst

I bet 3 Internets that someone else will have a much more elegant solution for you by tomorrow.

xzased 07-24-2010 01:35 AM

Wow, thanks sir. Your help is appreciated.

gabolander 10-19-2012 04:54 AM

Quote:

Originally Posted by Telengard (Post 4043661)
Okay, this has had only very minimal testing so use at your own discretion. As always, it is your responsibility to evaluate this code's suitability for your purposes.

Code:

#!/bin/bash

find -type f > names.lst

while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
        echo "$name"
        echo "$name2"
    fi
done < names.lst

I bet 3 Internets that someone else will have a much more elegant solution for you by tomorrow.


Hi Telengard, your script is good, but there's a little problem: all the duplicated filenames are displayed twice (Obviously .. if basename of files are duplicated in the list, they fall twice doing "grep" on the same list).
Starting from your script (thanks! ;) ), I applied some little change in order to display the duplicated couples only once. In my variant, also file size are shown.
Hoping to help somebody else, I paste the code hereafter:

Code:

#!/bin/bash

find -type f > names.lst

> names.out

TAB=`echo -ne "\t"`
while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
                        if !(grep -q "^${name}${TAB}" names.out); then
                                size=`stat --format=%s "$name"`
                                size2=`stat --format=%s "$name2"`
                                echo -e "$name${TAB}$size" >> names.out
                                echo -e "$name2${TAB}$size" >> names.out
                        fi
    fi
done < names.lst

cat names.out
rm -f names.lst


nt4boy 11-30-2012 08:03 AM

Duplicated file
 
Guys,

Telengard's script is definitely the business, found just what I was looking for, but I am afraid I cannot get gabolander's revision to run.

My Centos 5.8 seems to have an issue with grep in thisline:-

if !(grep -q "^${name}${TAB}" names.out); then

At least I am getting a grep error repeated and since the syntax from the orginal script is unchanged, I surmise that its the line I've pasted.

I'd appreciate some advice on this please.

Thanks

gabolander 11-30-2012 08:30 AM

Quote:

Originally Posted by nt4boy (Post 4840178)
Guys,

Telengard's script is definitely the business, found just what I was looking for, but I am afraid I cannot get gabolander's revision to run.

My Centos 5.8 seems to have an issue with grep in thisline:-

if !(grep -q "^${name}${TAB}" names.out); then

At least I am getting a grep error repeated and since the syntax from the orginal script is unchanged, I surmise that its the line I've pasted.

I'd appreciate some advice on this please.

Thanks

This is very weird really.. I just tested my script on a updated CentOS 5.8

Code:

[root@srv-rti /tmp/test]# lsb_release -a
LSB Version:    :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: CentOS
Description:    CentOS release 5.8 (Final)
Release:        5.8
Codename:      Final

[root@srv-rti /tmp/test]# uname -a
Linux srv-rti.comune.rimini.it 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

And the line with "if !(grep -q ...)" is a standard way of grepping something into Bash scripts ....

Are you sure to have prefixed "#!/bin/bash" in the first line of the script? (I don't wish it runs with standard sh, that could not have the same sintax at all of Bash extensions ... )

This is the result of running my script on two dup files in two subdirectories:

Code:

[root@srv-rti /tmp/test]# find_dups
./a/hobbitclient.cfg    1612
./b/hobbitclient.cfg    1612
./a/prova      423
./b/prova      423


Cheers,
G.

nt4boy 11-30-2012 08:57 AM

Duplications
 
This is pasted from my script, I do expect you instantly to point at my foolishness!

#!/bin/bash

find -type f > names.lst

> names.out

TAB='echo -ne "\t"'
while read name
do
bn="$( basename "$name" )"
name2="$( grep "$bn" names.lst | grep -v "$name" )"
if [ "$name2" != "" ]
then
if !(grep -q "^${name}${TAB}" names.out); then
size='stat --format=%s "$name"'
size2='stat --format=%s "$name2"'
echo -e "$name${TAB}$size" >> names.out
echo -e "$name2${TAB}$size" >> names.out
fi
fi
done < names.lst

cat names.out
# rm -f names.lst

nt4boy 11-30-2012 09:27 AM

Duplication
 
Right, I've a better result now.
I cut and pasted the original code into windows notepad....got in a state with line breaks and learned DOS2UNIX, but anyway, now got its precsiely into Linux, and it does run.

Sorry to mess you about.

However, while I scan the terminal window I see messages that it was unable to stat such and such a file and no such folder exists and the file name is too long, but maybe that's correct if there is no duplicate?

So, if there are no duplicates and I've a series of sub folder with thousands of files, it would be even better if the lines where there are no duplicates were not written to the output file.

the Duplicates do however end up with their stats on following lines.

Thanks

nt4boy 12-05-2012 06:31 AM

Finding Duplicates
 
All,

Just to round this off, I failed to get this to work on my set up so looked around some more and in the end http://www.perlmonks.org/?node_id=855401 provided me with exactly what I needed.

Had to edit the 1st post following the posters, but here is what worked for me:-

#!/usr/bin/perl
use strict;
use warnings;

use File::Compare;
use File::Find;

#If you want to set a base_directory, you can do so here.

my $base_directory;

print "What directory? ";
my $directory = <>;
chomp $directory;

my %files;
sub files_wanted {
my $raw_file = $File::Find::name;
if ( -f ) {
my ($volume,$directories,$file) = File::Spec->splitpath($raw_file);
#update from a prior suggestion.
my $file_size = -s $raw_file;
push @{$files{"$file ($file_size bytes)"}}, $raw_file;
}
}
#If you set a base directory above, you will need to change

find(\&files_wanted,$directory);


open (MYFILE, '>>dupes.txt');



#This section searches the hash for any file with 2 or more files which share the same filename.ext and size.
#After that, it compares all of the files with those attributes to determine if they share the same contents.
#It will print the list of files with the same filename and size and will tell you which ones share the same
#contents.

for my $file (sort keys %files) {
if (@{$files{$file}} > 1) {
my $amount = @{$files{$file}};
print (MYFILE) "$file\t\t$amount\n";
for my $location1 (@{$files{$file}}) {
print (MYFILE) "\t$location1\n";
for my $location2 (@{$files{$file}}) {
unless ($location1 eq $location2) {
if (compare($location1,$location2) == 0) {
print (MYFILE) "\t\tExact copy: $location2\n";
}
}
}
}
print (MYFILE) "\n";
}
}
close (MYFILE);


All times are GMT -5. The time now is 07:33 AM.