LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-27-2006, 08:10 PM   #1
smudge|lala
Member
 
Registered: Jan 2004
Location: New Zealand
Distribution: Mint | Sabayon
Posts: 160

Rep: Reputation: 16
Purge duplicate files in one directory


I seem to have this problem often. Linux intuitively renames files sometimes so if I had 'Car_pic.png' the duplicate would be named 'Car_pic(1).png'.

I am trying to figure out how to do this once and for all. I believe I need to pipe grep and sort commands together, sending duplicates to NULL (deleting them), or another directory as specified. Whether it's documents, logs, images, fonts or whatever, I always seem to end up with duplicate hell, I'm sure most of us do. Duplicates are a nightmare, and any database admin will know what I mean.

However, if the problem is clear, as in my .png example above, and the only difference between two files is an underscore '_' then this should be easy enough to sort. A more complex command might be to include checking the filesize, as one may have zero data (corrupt), and we don't want to keep that one!

I would be grateful for any tips.

Last edited by smudge|lala; 10-24-2006 at 03:54 AM.
 
Old 03-27-2006, 08:22 PM   #2
Emerson
LQ Sage
 
Registered: Nov 2004
Location: Saint Amant, Acadiana
Distribution: Gentoo ~amd64
Posts: 7,661

Rep: Reputation: Disabled
http://monsterden.net/software/dupefinder

Runs from console and also has QT GUI.
 
Old 03-27-2006, 08:31 PM   #3
smudge|lala
Member
 
Registered: Jan 2004
Location: New Zealand
Distribution: Mint | Sabayon
Posts: 160

Original Poster
Rep: Reputation: 16
It does work, thank you. If anyone knows a command sequence or script that might do something similar I'd be grateful for any input as I'm sure bash will do it. Dupefinder did find duplicates in under 10 seconds, but I'm not about to go through and mark 4111 files! Hence automating it by specifying the desired output.

mv file \*.png file_*.png or something

Last edited by smudge|lala; 03-27-2006 at 08:34 PM.
 
Old 03-27-2006, 09:43 PM   #4
Matir
LQ Guru
 
Registered: Nov 2004
Location: San Jose, CA
Distribution: Debian, Arch
Posts: 8,507

Rep: Reputation: 128Reputation: 128
This little scripting sequence should do what you need:
Code:
md5sum * | sort | uniq -d -w32 | cut -d' ' -f3 | xargs rm
The md5sum checks that the *contents* of files are the same, rather than names. It then uses sort and uniq to compare checksums and pipes a list of file names to rm.
 
Old 03-27-2006, 10:21 PM   #5
smudge|lala
Member
 
Registered: Jan 2004
Location: New Zealand
Distribution: Mint | Sabayon
Posts: 160

Original Poster
Rep: Reputation: 16
Thanks for that. I do get an error unfortunately.

rm: cannot remove `Action': No such file or directory

I thought I coult edit the command and copy rather than remove into a new directory, and that returned:

Code:
User@localhost $ md5sum * | sort | uniq -d -w32 | cut -d' ' -f3 | xargs cp Unique/
md5sum: Unique: Is a directory
xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option
cp: `Fanatika': specified destination directory does not exist
Try `cp --help' for more information.
I know bash can do this, I just can't figure out how. Xargs is a really powerful command!

Last edited by smudge|lala; 03-27-2006 at 10:22 PM.
 
Old 03-27-2006, 10:35 PM   #6
Matir
LQ Guru
 
Registered: Nov 2004
Location: San Jose, CA
Distribution: Debian, Arch
Posts: 8,507

Rep: Reputation: 128Reputation: 128
Try using:
Code:
xargs -i cp {} BACKUPS/
This is much like the find -exec syntax.
 
Old 03-28-2006, 09:30 AM   #7
smudge|lala
Member
 
Registered: Jan 2004
Location: New Zealand
Distribution: Mint | Sabayon
Posts: 160

Original Poster
Rep: Reputation: 16
Duplicate sort command not working

Thank you for your input guys, but it still isn't working. The uniq command looks interesting especially after the md5sum. The command hangs which makes me think an option/switch hasn't been set correctly, possibly with md5sum * but I'm not sure. Maybe xargs isn't getting the correct input to proceed?

In considering how to approach such a sort and purge, I suppose the system can take one file, and search for a duplicate by md5 anywhere, but in the same directory is more likely, then drop one of the two files if a duplicate is found. If this is how this command is trying to work, then where is all the data of comparisons, of md5sums going? Can xargs handle input from 4000 files?

I tried with only 6 files, 3 sets of duplicates. I get:

Code:
md5sum: BACKUPS: Is a directory
cp: cannot stat `XFile': No such file or directory
for each result. This is with the command I issued:

md5sum * | sort | uniq -d -w32 | cut -d' ' -f3 | xargs -i cp {} BACKUPS/

Trying again with md5sum * | sort | uniq -d -w32 | cut -d' ' -f3 | xargs rm I get the same error. I'm only using 6 small files to test, and binary or text, they should still work seeing as we're using md5 right?
 
Old 03-30-2006, 07:15 PM   #8
smudge|lala
Member
 
Registered: Jan 2004
Location: New Zealand
Distribution: Mint | Sabayon
Posts: 160

Original Poster
Rep: Reputation: 16
Perhaps if they are named slightly differently, such as all my duplicates have an underscore '_' such as big_cat.png then surely I can do something like:

cp *.png | grep '_' dupebak/ although this doesn't work..

Any ideas?
 
Old 03-30-2006, 08:08 PM   #9
Matir
LQ Guru
 
Registered: Nov 2004
Location: San Jose, CA
Distribution: Debian, Arch
Posts: 8,507

Rep: Reputation: 128Reputation: 128
If the only difference is the underscore, and no original files contain underscores, then you could do:
Code:
mv *_* newdir/
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
editors and duplicate files printf Linux - Newbie 7 11-22-2005 03:54 AM
duplicate home directory bong.mau Linux - General 3 09-19-2005 06:24 AM
duplicate directory tree for only certain file types curmudgeon42 Linux - Software 13 01-22-2005 02:38 PM
duplicate files in one folder! hornung Linux - Enterprise 1 01-13-2005 03:35 PM
Duplicate Files and linux carl0ski Linux - Software 1 12-22-2004 04:45 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 06:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration