Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Does anyone know of reliable software for finding duplicate files? I'm not concerned with duplicate file names, I'd like to find duplicate content.
I'm trying to clean up a music directory, I know I've got a number of repeating songs.
I was thinking of writing a script to md5sum all files in a directory, putting the results in a database then sorting out dupes that way. Would this work?
Does anyone know of reliable software for finding duplicate files?
I'm pretty sure Freshmeat and Sourceforge would have some.
I was thinking of writing a script to md5sum all files in a directory, putting the results in a database then sorting out dupes that way. Would this work?
Yes, partially: say you have the same album but compressed differently. That'll gen different sums. You'll still need the meta data, from like a cli tool like mp3info, to compare. Checking MD5sum could be the first and non-interactive iteration, and using meta information the final and interactive one, unless your regex-fu is kinda elite :-]
I have the same problem too. So, I use google to search for answers. Fortunately, there are many different solutions for this problem. Here is the answer that you want.
However, previous solution is very inefficient because it hash all files in your data storage (hard disk). To compare the files which their size is different is no necessary. So I improved this problem. First sort the file list by file size, and just compare the files which have the same file size.
I used previous scheme for many times, and then I noticed it was still inefficient if there were many files which have the same size. And I make the improvement as follows.
--------------------------------------
#!/bin/bash
# Tested under ubuntu8.10
echo "#!/bin/sh" > $OUTF;
find "$1" -type f -size +42k -printf %s\ -print |
sort -nr | uniq -w 9 -d --all-repeated | cut -f2- -d" " |
while read i;
do
Xsize1=$Xsize2
Xsize2=$(stat -c%s "$i")
XfilepathNname1="$XfilepathNname2"
XfilepathNname2="$i"
if [ "$Xsize1" == "$Xsize2" ]; then
cmp --silent "$XfilepathNname1" "$XfilepathNname2"
if [ "$?" == "0" ]; then
if [ "$Xflag" == "0" ]; then
echo "#rm \"$XfilepathNname1\""
fi
echo "#rm \"$XfilepathNname2\""
Xflag=$(($Xflag + 1))
else
continue
fi
else
if [ "$Xflag" != "0" ]; then
echo "" >> $OUTF
fi
Xcounter=$(($Xcounter + 1))
Xflag=0
fi
done >> $OUTF
echo "exit 0;" >> $OUTF
chmod a+x $OUTF
-----------------------------------
The last script is very efficient. It should be much efficient than NoClone, FSlint, and many duplicate file finder that I ever used. However, it still exists a little problem. The mistake occurs if some hard symbol links in your hard disk. The last script will find duplicate files if two hard symbol links point to the same file. Fortunately, I don't use hard symbol link for my data storage. Could someone improve this problem?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.