LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 10-10-2009, 01:25 AM   #1
Tag234
LQ Newbie
 
Registered: Oct 2009
Posts: 3

Rep: Reputation: 0
Simple Shell Script? Deleting Duplicate Files...


Hello,

I'm very new to shell scripting and am currently trying to write a simple script which should be able to delete all duplicate files within a specified directory. The teacher for our class advised us to use the cmp command in order to compare files... However, I dont quite understand how I can pipe file names into cmp for use.

Ive read around online and looked in as many places as I could find and I havent been able to find a good way to do this. Currently, the closest ive come is using:

find "$@" -type f -print0 | xargs -0 md5sum | sort -u | uniq -w32 -d --all-repeated=separate | cut -c35-

to list all duplicate files within a directory, but after ive done this, How am I supposed to somehow load those into a for statement (or something...) to delete all duplicates except 1?

For the record, the prompt states that files alphabetically higher should be the files that survive the delete with .'s taking an even higher precedence. IE:

If c, .c and c2 are all duplicates, .c should remain.

Any help is greatly appreciated. I really just need some small piece of information in order to get myself on track to finish this.
 
Old 10-10-2009, 02:46 AM   #2
lutusp
Member
 
Registered: Sep 2009
Distribution: Fedora
Posts: 835

Rep: Reputation: 102Reputation: 102
Quote:
Originally Posted by Tag234 View Post
Hello,

I'm very new to shell scripting and am currently trying to write a simple script which should be able to delete all duplicate files within a specified directory. The teacher for our class advised us to use the cmp command in order to compare files... However, I dont quite understand how I can pipe file names into cmp for use.

Ive read around online and looked in as many places as I could find and I havent been able to find a good way to do this. Currently, the closest ive come is using:

find "$@" -type f -print0 | xargs -0 md5sum | sort -u | uniq -w32 -d --all-repeated=separate | cut -c35-

to list all duplicate files within a directory, but after ive done this, How am I supposed to somehow load those into a for statement (or something...) to delete all duplicates except 1?

For the record, the prompt states that files alphabetically higher should be the files that survive the delete with .'s taking an even higher precedence. IE:

If c, .c and c2 are all duplicates, .c should remain.

Any help is greatly appreciated. I really just need some small piece of information in order to get myself on track to finish this.
The problem you are tackling is not at all easy - not at all. First, you need to compare every file with every other file -- there is no way to know in advance which files might be identical in content, for example by being adjacent in a sort.

To solve this problem, you have to create two loops -- an inner and outer loop, both scanning the same file names. Inside the inner loop, you have to compare any two files, except you have to guard against comparing a file with itself. Then, having found two identical files, you enter them into a separate list for later deletion.

Take this advice -- drop any idea of using a one-line approach like you show in your post -- there is no way that will work. Solve this using a formal method, and be willing to write more than one line of code.

1. Create a file list containing all the file names in the target directory.

2. Double-loop the file list, comparing files as you go, remembering to avoid comparing a file with itself.

3. Make a separate list of identical files -- do not delete them right away because this will break your scanning loops.

4. After the double-loop scan is complete, scan the delete list and delete the appropriate file in each pair.

5. Done.

Remember, you don't prove how smart you are by creating an incomprehensible single-line program. Instead, you show how smart you are by writing a program that works, that can be understood, and that can be repurposed for another task later (is "reusable").

Here is a rough outline of the script you need to write -- I emphasize this is just the beginning, and much remains to be done:


Code:
spath="/source/path"
suffix_arg="\.txt$"

find $spath -type f | grep -P "$suffix_arg" > temp.txt

declare -a list

while read path
do
   list[${#list[*]}]="$path"
done < temp.txt

for ((y = 0;y < ${#list[*]};y++))
do
   for ((x = 0;x < ${#list[*]};x++))
   do
      a="${list[$y]}"
      b="${list[$x]}"
      if [ "$a" != "$b" ] && cmp $a $b > /dev/null
      then
         echo "File $a is identical to file $b"
      fi
   done
done

Last edited by lutusp; 10-10-2009 at 05:53 AM.
 
Old 10-10-2009, 03:38 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,838

Rep: Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822Reputation: 1822
Lots of good advice there, but the example would be pretty "deep" for a person who admits to being "very new to shell scripting".
 
Old 10-10-2009, 04:34 AM   #4
Tag234
LQ Newbie
 
Registered: Oct 2009
Posts: 3

Original Poster
Rep: Reputation: 0
Thankyou for the help, im going to work with what you've given me and see if I can do it.
 
Old 10-10-2009, 04:44 AM   #5
Addison0
LQ Newbie
 
Registered: Jul 2009
Posts: 5

Rep: Reputation: 0
Quote:
Originally Posted by syg00 View Post
Lots of good advice there, but the example would be pretty "deep" for a person who admits to being "very new to shell scripting".
you are right
 
Old 10-10-2009, 04:45 AM   #6
lutusp
Member
 
Registered: Sep 2009
Distribution: Fedora
Posts: 835

Rep: Reputation: 102Reputation: 102
Quote:
Originally Posted by syg00 View Post
Lots of good advice there, but the example would be pretty "deep" for a person who admits to being "very new to shell scripting".
Yes, but he stated the problem, and there's no easy way to solve it. It isn't as though we can reduce the complexity of the problem to suit the audience -- if that were true, people would build fusion reactors in their home workshops using pliers and toothpicks.

Also, as a student he might benefit from seeing a solution that doesn't try to obfuscate the method by putting it all on one line (as is so often the case here).
 
Old 10-10-2009, 05:49 AM   #7
Tag234
LQ Newbie
 
Registered: Oct 2009
Posts: 3

Original Poster
Rep: Reputation: 0
I got the script working exactly how i wanted it to work Thankyou for the help.

BTW, this was alot more helpful than anything I could find searching google. From what I read, most people did try to condense this into one line and seeing it spread out (and similar to C++ which i am more familiar with) really helped. Thankyou
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Deleting Duplicate files on server adamsjw2 Linux - Server 1 03-07-2007 04:09 PM
simple shell script to iterate the files varunbihani Linux - General 11 02-22-2007 02:23 AM
deleting duplicate files cs-cam Linux - General 3 11-15-2006 12:27 AM
Need a simple shell script please overwritting files. stefaandk Programming 9 10-11-2006 08:24 AM


All times are GMT -5. The time now is 01:13 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration