LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Closed Thread
 
LinkBack Search this Thread
Old 03-29-2006, 03:27 AM   #1
mike_savoie
LQ Newbie
 
Registered: Jul 2005
Location: PEI, Canada
Distribution: Slackware 11.0
Posts: 20

Rep: Reputation: 0
Software to find duplicate files


Does anyone know of reliable software for finding duplicate files? I'm not concerned with duplicate file names, I'd like to find duplicate content.

I'm trying to clean up a music directory, I know I've got a number of repeating songs.

I was thinking of writing a script to md5sum all files in a directory, putting the results in a database then sorting out dupes that way. Would this work?

Any suggestions are appreciated.
 
Old 03-29-2006, 06:06 AM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 26,534
Blog Entries: 51

Rep: Reputation: 2603Reputation: 2603Reputation: 2603Reputation: 2603Reputation: 2603Reputation: 2603Reputation: 2603Reputation: 2603Reputation: 2603Reputation: 2603Reputation: 2603
Does anyone know of reliable software for finding duplicate files?
I'm pretty sure Freshmeat and Sourceforge would have some.


I was thinking of writing a script to md5sum all files in a directory, putting the results in a database then sorting out dupes that way. Would this work?
Yes, partially: say you have the same album but compressed differently. That'll gen different sums. You'll still need the meta data, from like a cli tool like mp3info, to compare. Checking MD5sum could be the first and non-interactive iteration, and using meta information the final and interactive one, unless your regex-fu is kinda elite :-]
 
Old 03-27-2009, 08:29 PM   #3
chrysler
LQ Newbie
 
Registered: Mar 2009
Location: Taiwan
Distribution: ubuntu
Posts: 3

Rep: Reputation: 0
Smile partial answer for finding duplicate files

I have the same problem too. So, I use google to search for answers. Fortunately, there are many different solutions for this problem. Here is the answer that you want.

-------------------------------------
#!/bin/bash
OUTF=~/rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find /dataStorage/ -type f -print0 |
xargs -0 -n1 md5sum |
sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
-------------------------------------

However, previous solution is very inefficient because it hash all files in your data storage (hard disk). To compare the files which their size is different is no necessary. So I improved this problem. First sort the file list by file size, and just compare the files which have the same file size.

-------------------------------------
#!/bin/bash
OUTF=~/RMduplicates.sh;
echo "#!/bin/sh" > $OUTF;
find "$1" -type f -size +1k |
while read i; do echo $(stat -c%s "$i") "$i"; done |
sort -nr | uniq -w 9 -d --all-repeated |
cut -f2- -d" " |
while read i; do md5sum -b "$i"; done |
uniq -w 32 -d --all-repeated=separate |
cut -f2- -d"/" |
while read i;
do
test -z "$i" && echo "" || echo "#rm \"/$i\"";
done >> $OUTF
echo "exit 0;" >> $OUTF
chmod a+x $OUTF
--------------------------------------

I used previous scheme for many times, and then I noticed it was still inefficient if there were many files which have the same size. And I make the improvement as follows.

--------------------------------------
#!/bin/bash
# Tested under ubuntu8.10

echo `date +%Y%m%d%H%M%S`
Xcounter=0;
Xflag=0;
OUTF=~/duplicatesRM2.sh;

echo "#!/bin/sh" > $OUTF;
find "$1" -type f -size +42k -printf %s\ -print |
sort -nr | uniq -w 9 -d --all-repeated | cut -f2- -d" " |
while read i;
do
Xsize1=$Xsize2
Xsize2=$(stat -c%s "$i")
XfilepathNname1="$XfilepathNname2"
XfilepathNname2="$i"
if [ "$Xsize1" == "$Xsize2" ]; then
cmp --silent "$XfilepathNname1" "$XfilepathNname2"
if [ "$?" == "0" ]; then
if [ "$Xflag" == "0" ]; then
echo "#rm \"$XfilepathNname1\""
fi
echo "#rm \"$XfilepathNname2\""
Xflag=$(($Xflag + 1))
else
continue
fi
else
if [ "$Xflag" != "0" ]; then
echo "" >> $OUTF
fi
Xcounter=$(($Xcounter + 1))
Xflag=0
fi
done >> $OUTF
echo "exit 0;" >> $OUTF
chmod a+x $OUTF
-----------------------------------

The last script is very efficient. It should be much efficient than NoClone, FSlint, and many duplicate file finder that I ever used. However, it still exists a little problem. The mistake occurs if some hard symbol links in your hard disk. The last script will find duplicate files if two hard symbol links point to the same file. Fortunately, I don't use hard symbol link for my data storage. Could someone improve this problem?
 
Old 03-28-2009, 04:31 AM   #4
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1266Reputation: 1266Reputation: 1266Reputation: 1266Reputation: 1266Reputation: 1266Reputation: 1266Reputation: 1266Reputation: 1266
There's plenty of programs here:
http://freshmeat.net/search?q=duplicate&submit=Search

Here's one specifically for music:
http://freshmeat.net/projects/fdmf

I personally use fdupes:
http://freshmeat.net/projects/fdupes
 
Old 03-28-2009, 05:48 AM   #5
JulianTosh
Member
 
Registered: Sep 2007
Location: Las Vegas, NV
Distribution: Fedora / CentOS
Posts: 674
Blog Entries: 3

Rep: Reputation: 90
I know there's lots of stuff out there already, but I was bored.

Save this as findDups.sh
Code:
#!/bin/bash

if [ -z $1 ]; then
	echo "  Error: Gimme dir!"
	echo
	exit 1
fi

searchPath="$1"

# hash all files found
find $searchPath -exec sha1sum {} \; > fileList.txt

# sort and uniq duplicate hashes and strip out filename
sort fileList.txt | uniq --check-chars=40 --repeated | sed 's/\s.*//g' > fileListHashDups.txt

# for each duplicate hash, grep and cut filenames
exec < fileListHashDups.txt
while read line
do
  echo ========================================
  grep $line fileList.txt | cut -c43- | tee Duplicates.txt
done
echo

rm -f fileList.txt fileListHashDups.txt
 
Old 07-17-2010, 03:04 PM   #6
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Old thread dug up and attracting "spam-flies". Closed
 
  


Closed Thread


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Installed windows software using wine, now i cant find the files badgerbox76 Linux - General 4 02-17-2006 11:58 PM
duplicate files in one folder! hornung Linux - Enterprise 1 01-13-2005 03:35 PM
how to find duplicate emails in an mbox-mailbox Misel Linux - Software 1 03-10-2004 04:56 AM
Howto find duplicate files js72 Linux - Software 1 11-09-2003 04:55 AM
How 2 find a duplicate word in a text file cowardnewbie Programming 1 09-16-2001 11:57 PM


All times are GMT -5. The time now is 10:52 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration