LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 07-24-2010, 12:14 AM   #1
xzased
LQ Newbie
 
Registered: Jan 2008
Posts: 20

Rep: Reputation: 0
Find duplicate files by name


Hi, we have a huge amount of duplicate files in a folder and I would like some pointers on to writing a bash script to create a list of the duplicate files. I've seen examples that check for the md5 sum of files... but I dont need that, the file name is enough. Can someone please help me?
 
Old 07-24-2010, 12:29 AM   #2
Telengard
Member
 
Registered: Apr 2007
Location: USA
Distribution: Kubuntu 8.04
Posts: 579
Blog Entries: 8

Rep: Reputation: 147Reputation: 147
Quote:
Originally Posted by xzased View Post
Hi, we have a huge amount of duplicate files in a folder and I would like some pointers on to writing a bash script to create a list of the duplicate files ... the file name is enough.
Umm, it should not be possible to have two files with the same name in the same folder.
 
Old 07-24-2010, 12:59 AM   #3
xzased
LQ Newbie
 
Registered: Jan 2008
Posts: 20

Original Poster
Rep: Reputation: 0
LOL! True Mr. Telengard. I meant in subfolders. So I have the directory /storage which holds about 10 subfolders which hold around 3 more subfolders each with around 300 + files in. Messy, I know. So the duplicates are between subfolders.
 
Old 07-24-2010, 01:06 AM   #4
Telengard
Member
 
Registered: Apr 2007
Location: USA
Distribution: Kubuntu 8.04
Posts: 579
Blog Entries: 8

Rep: Reputation: 147Reputation: 147
Okay, this has had only very minimal testing so use at your own discretion. As always, it is your responsibility to evaluate this code's suitability for your purposes.

Code:
#!/bin/bash

find -type f > names.lst

while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
	echo "$name"
	echo "$name2"
    fi
done < names.lst
I bet 3 Internets that someone else will have a much more elegant solution for you by tomorrow.
 
1 members found this post helpful.
Old 07-24-2010, 01:35 AM   #5
xzased
LQ Newbie
 
Registered: Jan 2008
Posts: 20

Original Poster
Rep: Reputation: 0
Wow, thanks sir. Your help is appreciated.
 
Old 10-19-2012, 04:54 AM   #6
gabolander
LQ Newbie
 
Registered: Sep 2008
Posts: 7

Rep: Reputation: 0
Wink

Quote:
Originally Posted by Telengard View Post
Okay, this has had only very minimal testing so use at your own discretion. As always, it is your responsibility to evaluate this code's suitability for your purposes.

Code:
#!/bin/bash

find -type f > names.lst

while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
	echo "$name"
	echo "$name2"
    fi
done < names.lst
I bet 3 Internets that someone else will have a much more elegant solution for you by tomorrow.

Hi Telengard, your script is good, but there's a little problem: all the duplicated filenames are displayed twice (Obviously .. if basename of files are duplicated in the list, they fall twice doing "grep" on the same list).
Starting from your script (thanks! ), I applied some little change in order to display the duplicated couples only once. In my variant, also file size are shown.
Hoping to help somebody else, I paste the code hereafter:

Code:
#!/bin/bash

find -type f > names.lst

> names.out

TAB=`echo -ne "\t"`
while read name
do
    bn="$( basename "$name" )"
    name2="$( grep "$bn" names.lst | grep -v "$name" )"
    if [ "$name2" != "" ]
    then
                        if !(grep -q "^${name}${TAB}" names.out); then
                                size=`stat --format=%s "$name"`
                                size2=`stat --format=%s "$name2"`
                                echo -e "$name${TAB}$size" >> names.out
                                echo -e "$name2${TAB}$size" >> names.out
                        fi
    fi
done < names.lst

cat names.out
rm -f names.lst

Last edited by gabolander; 10-19-2012 at 04:55 AM. Reason: Changed email notification type
 
Old 11-30-2012, 08:03 AM   #7
nt4boy
LQ Newbie
 
Registered: Nov 2012
Posts: 4

Rep: Reputation: Disabled
Duplicated file

Guys,

Telengard's script is definitely the business, found just what I was looking for, but I am afraid I cannot get gabolander's revision to run.

My Centos 5.8 seems to have an issue with grep in thisline:-

if !(grep -q "^${name}${TAB}" names.out); then

At least I am getting a grep error repeated and since the syntax from the orginal script is unchanged, I surmise that its the line I've pasted.

I'd appreciate some advice on this please.

Thanks
 
Old 11-30-2012, 08:30 AM   #8
gabolander
LQ Newbie
 
Registered: Sep 2008
Posts: 7

Rep: Reputation: 0
Quote:
Originally Posted by nt4boy View Post
Guys,

Telengard's script is definitely the business, found just what I was looking for, but I am afraid I cannot get gabolander's revision to run.

My Centos 5.8 seems to have an issue with grep in thisline:-

if !(grep -q "^${name}${TAB}" names.out); then

At least I am getting a grep error repeated and since the syntax from the orginal script is unchanged, I surmise that its the line I've pasted.

I'd appreciate some advice on this please.

Thanks
This is very weird really.. I just tested my script on a updated CentOS 5.8

Code:
[root@srv-rti /tmp/test]# lsb_release -a
LSB Version:    :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: CentOS
Description:    CentOS release 5.8 (Final)
Release:        5.8
Codename:       Final

[root@srv-rti /tmp/test]# uname -a
Linux srv-rti.comune.rimini.it 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
And the line with "if !(grep -q ...)" is a standard way of grepping something into Bash scripts ....

Are you sure to have prefixed "#!/bin/bash" in the first line of the script? (I don't wish it runs with standard sh, that could not have the same sintax at all of Bash extensions ... )

This is the result of running my script on two dup files in two subdirectories:

Code:
[root@srv-rti /tmp/test]# find_dups 
./a/hobbitclient.cfg    1612
./b/hobbitclient.cfg    1612
./a/prova       423
./b/prova       423

Cheers,
G.

Last edited by gabolander; 11-30-2012 at 08:36 AM.
 
Old 11-30-2012, 08:57 AM   #9
nt4boy
LQ Newbie
 
Registered: Nov 2012
Posts: 4

Rep: Reputation: Disabled
Duplications

This is pasted from my script, I do expect you instantly to point at my foolishness!

#!/bin/bash

find -type f > names.lst

> names.out

TAB='echo -ne "\t"'
while read name
do
bn="$( basename "$name" )"
name2="$( grep "$bn" names.lst | grep -v "$name" )"
if [ "$name2" != "" ]
then
if !(grep -q "^${name}${TAB}" names.out); then
size='stat --format=%s "$name"'
size2='stat --format=%s "$name2"'
echo -e "$name${TAB}$size" >> names.out
echo -e "$name2${TAB}$size" >> names.out
fi
fi
done < names.lst

cat names.out
# rm -f names.lst
 
Old 11-30-2012, 09:27 AM   #10
nt4boy
LQ Newbie
 
Registered: Nov 2012
Posts: 4

Rep: Reputation: Disabled
Duplication

Right, I've a better result now.
I cut and pasted the original code into windows notepad....got in a state with line breaks and learned DOS2UNIX, but anyway, now got its precsiely into Linux, and it does run.

Sorry to mess you about.

However, while I scan the terminal window I see messages that it was unable to stat such and such a file and no such folder exists and the file name is too long, but maybe that's correct if there is no duplicate?

So, if there are no duplicates and I've a series of sub folder with thousands of files, it would be even better if the lines where there are no duplicates were not written to the output file.

the Duplicates do however end up with their stats on following lines.

Thanks

Last edited by nt4boy; 11-30-2012 at 09:33 AM. Reason: Typos
 
Old 12-05-2012, 06:31 AM   #11
nt4boy
LQ Newbie
 
Registered: Nov 2012
Posts: 4

Rep: Reputation: Disabled
Finding Duplicates

All,

Just to round this off, I failed to get this to work on my set up so looked around some more and in the end http://www.perlmonks.org/?node_id=855401 provided me with exactly what I needed.

Had to edit the 1st post following the posters, but here is what worked for me:-

#!/usr/bin/perl
use strict;
use warnings;

use File::Compare;
use File::Find;

#If you want to set a base_directory, you can do so here.

my $base_directory;

print "What directory? ";
my $directory = <>;
chomp $directory;

my %files;
sub files_wanted {
my $raw_file = $File::Find::name;
if ( -f ) {
my ($volume,$directories,$file) = File::Spec->splitpath($raw_file);
#update from a prior suggestion.
my $file_size = -s $raw_file;
push @{$files{"$file ($file_size bytes)"}}, $raw_file;
}
}
#If you set a base directory above, you will need to change

find(\&files_wanted,$directory);


open (MYFILE, '>>dupes.txt');



#This section searches the hash for any file with 2 or more files which share the same filename.ext and size.
#After that, it compares all of the files with those attributes to determine if they share the same contents.
#It will print the list of files with the same filename and size and will tell you which ones share the same
#contents.

for my $file (sort keys %files) {
if (@{$files{$file}} > 1) {
my $amount = @{$files{$file}};
print (MYFILE) "$file\t\t$amount\n";
for my $location1 (@{$files{$file}}) {
print (MYFILE) "\t$location1\n";
for my $location2 (@{$files{$file}}) {
unless ($location1 eq $location2) {
if (compare($location1,$location2) == 0) {
print (MYFILE) "\t\tExact copy: $location2\n";
}
}
}
}
print (MYFILE) "\n";
}
}
close (MYFILE);
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Script to find the duplicate files Tekken Linux - Server 6 03-30-2013 11:29 AM
Software to find duplicate files mike_savoie Linux - Software 5 07-17-2010 03:04 PM
Best tool to find duplicate files? arfon Linux - Software 1 05-16-2010 11:51 AM
Find Duplicate Files caponewgp Linux - Newbie 9 09-10-2009 12:20 AM
Howto find duplicate files js72 Linux - Software 1 11-09-2003 04:55 AM


All times are GMT -5. The time now is 10:04 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration