LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-29-2015, 06:26 AM   #1
wroom
Member
 
Registered: Dec 2009
Location: Sweden
Posts: 159

Rep: Reputation: 31
Question bash find prune duplicates in directory tree


Is the following bash command a safe method of pruning out regular files, from directory tree "/olddir", that exist in both directory trees "/olddir" and "/freshdir", and that has the same unchanged contents in both directory trees?
Could this be done in a better way?

Code:
pushd /olddir
find . -type f -exec test -f /freshdir/'{}' \; -exec cmp -s '{}' /freshdir/'{}' \; -exec ls '{}' \; -exec rm '{}' \;
popd
"/freshdir" contains fresh up to date contents in a directory tree of regular files.
"/olddir" contains older version of the same contents in a directory tree of regular files.

Environment is OpenSUSE 11.4 to OpenSUSE 13.2 and similar mainstream Linux distributions. Everything executed with root privileges.

What i want to do is quite the opposite of:
Code:
rsync -avc --delete /freshdir/ /olddir/
I want to keep only the files in "/olddir" that differs from "/freshdir".


Another question:
Is there a simple way to also check for differences in file meta, like privileges, ownership and file times?
(Similar to the tests on meta changes done by rsync).

Let's say we have pruned out duplicate regular files from "/olddir" - Is there a better way to also prune out all empty directories from the "/olddir" tree, than the following crude method?
Code:
find /olddir -type d -exec rmdir '{}' \;
The command "test" can check if one file is older than the other or not. But it does not test for same datetime or not. And it does not test for differences in other meta. It would be nice if "test" could check if two files have same meta, like ctime, mtime, ownership and privileges.
 
Old 10-29-2015, 12:51 PM   #2
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 447Reputation: 447Reputation: 447Reputation: 447Reputation: 447
I would use rdiff-backup

http://www.nongnu.org/rdiff-backup/

It would take a bit more diskspace, but you can get any version of a file you deleted or changed days ago, including permissions, acls and everything

I have in cron.daily:

rdiff-backup /freshdir /olddir
rdiff-backup --remove-older-than 3W /olddir
 
Old 10-29-2015, 02:04 PM   #3
wroom
Member
 
Registered: Dec 2009
Location: Sweden
Posts: 159

Original Poster
Rep: Reputation: 31
Yes. rdiff-backup is a very good backup application, that can efficiently store incremental/full backups.

But how can it be used to prune away everything that is duplicate in one directory tree from another directory tree, leaving only what is unique in that directory tree?
 
Old 11-03-2015, 12:16 PM   #4
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 447Reputation: 447Reputation: 447Reputation: 447Reputation: 447
I don't think it can - just thinking you would probably be a better backup solution. There's a lot to think about and many things can go wrong.

Your script will be a slow when there are lots of files. Every one of those -exec will start a new shell for every file. I don't think you can get rsync to do what you want, but you could use a dry-run and get a list of files that a regular rsync would skip.

Code:
rsync -a -vv --dry-run fresh old
It writes " is uptodate" for every file it would skip, so these are the ones you want to delete? To filter out those, you could do something like this:

Code:
rsync -a -vv --dry-run fresh old | egrep ' is uptodate$' | sed 's/ is uptodate//g'
 
Old 11-04-2015, 06:01 AM   #5
wroom
Member
 
Registered: Dec 2009
Location: Sweden
Posts: 159

Original Poster
Rep: Reputation: 31
Quote:
Originally Posted by Guttorm View Post
Your script will be a slow when there are lots of files. Every one of those -exec will start a new shell for every file.
One of the really great benefits of Unix/Linux is that the cost for spawning/killing a shell for running the same command over and over again is very small. It will basically just copy the parent shell and put in the command from cached pages.


Quote:
Originally Posted by Guttorm View Post
I don't think you can get rsync to do what you want, but you could use a dry-run and get a list of files that a regular rsync would skip.

Code:
rsync -a -vv --dry-run fresh old
It writes " is uptodate" for every file it would skip, so these are the ones you want to delete? To filter out those, you could do something like this:

Code:
rsync -a -vv --dry-run fresh old | egrep ' is uptodate$' | sed 's/ is uptodate//g'
Yes, that is an interesting approach. Although rsync will probably not easily lend itself to doing this.

Lets make a test case.
Code:
mkdir freshdir
mkdir olddir
mkdir freshdir/sub1
touch freshdir/sub1/fila1
touch freshdir/sub1/fila2
dd if=/dev/zero bs=1k count=1k of=freshdir/filb1
dd if=/dev/zero bs=1k count=1k of=freshdir/filb2
dd if=/dev/zero bs=1k count=2k of=freshdir/filc1
dd if=/dev/zero bs=1k count=3k of=freshdir/filc2
chmod 644 freshdir/filc1
rsync -av --delete freshdir/ olddir/
Now we have the directories freshdir and olddir which have exactly the same contents.

So we continue by making some changes.
Code:
# Change meta of freshdir/sub1/fila2
touch freshdir/sub1/fila2
# change contents of freshdir/filb1 and freshdir/filb2
dd if=/dev/zero bs=1k count=2k of=freshdir/filb1
dd if=/dev/zero bs=1k count=2k of=freshdir/filb2
# force meta (time) of freshdir/filb2 and olddir/filb2 to be the same
touch -t 201511041121.00 olddir/filb2
touch -t 201511041121.00 freshdir/filb2
# Change meta (privileges) of freshdir/filc1
chmod 664 freshdir/filc1
# Create two new equal files freshdir/fild1 and olddir/fild1
touch -t 199901011111.11 freshdir/fild1
touch -t 199901011111.11 olddir/fild1
# Create two new files freshdir/fild2 and olddir/fild2 with same contents but different meta (time)
touch -t 199901011111.11 freshdir/fild2
touch -t 199901011111.10 olddir/fild2
Now freshdir is different to olddir. In contents and/or meta.

Directories compare like the following:
Code:
sub1/fila1 : Same
sub1/fila2 : different time meta
filb1      : Different time meta/contents
filb2      : Same meta. Different contents
filc1      : Different privs meta
filc2      : Same
fild1      : Same
fild2      : Different time meta
So we let rsync make a dry run, and look at its output.
Code:
rsync -avvn --delete freshdir/ olddir/
sending incremental file list
delta-transmission disabled for local transfer or --whole-file
filc1
filc2 is uptodate
fild1 is uptodate
sub1/fila1 is uptodate
./
filb1
filb2
fild2
sub1/fila2
total: matches=0  hash_hits=0  false_alarms=0 data=0

sent 236 bytes  received 191 bytes  854.00 bytes/sec
total size is 9,437,184  speedup is 22,101.13 (DRY RUN)
It will mark the files sub1/fila1, filc2 and fild1 as being same in meta and contents.

Now we run rsync and apply your filter:
Code:
rsync -avvn --delete freshdir/ olddir/ | egrep ' is uptodate$' | sed 's/ is uptodate//g'
filc2
fild1
sub1/fila1
It gives a list of subpath/file for files that are the same in meta and contents.

the list could then be used with something like:
Code:
rsync -avvn --delete freshdir/ olddir/ | egrep ' is uptodate$' | sed 's/ is uptodate//g' | tee thefiles
pushd olddir && cat ../thefiles | xargs rm ; popd
It works like a charm to produce a duplicate-file kill list. And the kill list can be inspected before we start deleting the files.
Thank you for the idea.

But rsync will of course output lots of other info.
So your approach is quite doable, but is it safe?
 
Old 11-04-2015, 06:08 AM   #6
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 447Reputation: 447Reputation: 447Reputation: 447Reputation: 447
I think it will break on some strange filenames. For example if you have a file called "something is uptodate" (it ends with " is uptodate") - or if you have newlines in filenames. Is it possible you have such files?

Edit:
Maybe just add some --exclude to the rsync? Then files with a newline in the name or ending with " is uptodate" will simply be excluded in the list.

Last edited by Guttorm; 11-04-2015 at 06:15 AM.
 
Old 11-04-2015, 07:00 AM   #7
wroom
Member
 
Registered: Dec 2009
Location: Sweden
Posts: 159

Original Poster
Rep: Reputation: 31
Quote:
Originally Posted by Guttorm View Post
I think it will break on some strange filenames...
Yes, it will break on filenames like "this directory is uptodate".

And we always have the hazzle with characters in filenames that break stuff. Like injecting text into the shell stdin via the tty. There are some really nasty "shell bombs".

Good reading for anyone curious on how not to do in a bash shell:
http://mywiki.wooledge.org/BashPitfalls
 
Old 11-05-2015, 10:09 AM   #8
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 447Reputation: 447Reputation: 447Reputation: 447Reputation: 447
Yes, handeling strange filenames is difficult. The rule that a filename can use any character except / and ascii zero is weird indead.

I just tried this

touch "a
b
c
d
e
f"

But rsync -vv reported the file as "\#012a\#012b\#012c\#012d\#012e\#012f" so they have thinked about it. And ls reported it as "?a?b?c?d?e?f".

In your case, I think the xargs will never run anything but "rm". So it means files ending with " is uptodate" will get deleted, and filenames containing newlines and other weird characters will never be deleted - the rm will probaly report "file not found".

But maybe you should post your final script? There are some experts on this in this forum.
 
Old 11-06-2015, 06:22 AM   #9
wroom
Member
 
Registered: Dec 2009
Location: Sweden
Posts: 159

Original Poster
Rep: Reputation: 31
Quote:
Originally Posted by Guttorm View Post
...But maybe you should post your final script? There are some experts on this in this forum.
What i posted first in this thread is still my final script:
Code:
pushd /olddir
find . -type f -exec test -f /freshdir/'{}' \; -exec cmp -s '{}' /freshdir/'{}' \; -exec ls '{}' \; -exec rm '{}' \;
popd
You are the only expert that have found this thread interesting as of yet.
I am very grateful for your inputs to this.

Using rsync as purely a "condition statement" is interesting. But at the same time, it is using rsync as it was not intended, and parsing the log output of rsync may possibly be insecure?

It seems this exclusion mechanism i am looking for has not been thought of much. Not in 'find', not in diff utils, 'test', or even 'rsync'.

Looking at 'test', the options are '-ef', '-nt' and '-of', for checking if:
Code:
-ef  FILE1 and FILE2 have the same device and inode numbers
-nt  FILE1 is newer (modification date) than FILE2
-of  FILE1 is older than FILE2
One could possibly make a:
Code:
test !('{}' -nt /freshdir/'{}') -a !('{}' -ot /freshdir/'{}')
..to check if the files have the same 'mtime'. But what about testing the rest of the meta? Ownership and privs?
It seems to have been forgotten.
 
  


Reply

Tags
bash scripting, diff, find command, prune



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Escaping the parentheses characters in a bash script which calls find with -prune paul.richter Linux - Newbie 9 07-07-2010 09:59 PM
Bash script for monitoring changes in a directory tree librano Linux - Software 4 10-30-2008 07:36 PM
Bash, deleting only files in a directory tree? jon_k Programming 4 05-13-2007 11:36 AM
How to prune more than one directory using GNU find judgex Programming 1 06-03-2006 11:50 PM
Prune Portage tree beaucoup Linux - Software 3 02-05-2005 04:03 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:13 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration