Comparing two directories

fergus · 04-03-2012, 04:36 AM

Looking through Forum and sub-Forum titles, this seems to be the only place that it's all right to ask about syntax. Is this OK? Lots of moderators can get very snippy so please let me know, and if necessary re-direct me to another forum or another site ...
Comparing two directories
I'm interested in comparing two directories dir1 and dir2 both of size about 600G (each backs up an entire resource, in fact). Essentially I need to do
diff -rq /dir1 /dir2
but diff compares file contents byte-by-byte and I really don't want, or need, to do this too often. Most of the time all I need to do is check presence/ absence/ location. That is, I just need the output from
diff -rq /dir1 /dir2 | grep "^Only in"
without the forensic effort implied by diff. Some variation on find suggests itself. I have iterated to
comm -3 <(find /dir1 | sort | sed 's/^.....//g') <(find /dir2 | sort | sed 's/^.....//g')
where the ..... just gets rid of the leading /dir?/. Actually this provides too much information, recounting not just unmatching directories but also superfluously all the files they contain. (But find -d initiates too skimpy a search, because it will not identify unmatching files!) I want to abbreviate the output from this command (I dunno, kind of "zip" it?) so that it concisely provides just the minimal information that diff provides by "Only in ..".
I can't believe this is an original request, and I am quite surprised that diff does not provide it through some kind of option. Google-ing suggests dircmp, but that seems largely unavailable and where it is available, highly variable in design and output. I'm sure the script above, with tinkering to achieve the concatenation required, is the answer.
Anybody out there seen this, done this?
Thank you!

pan64 · 04-03-2012, 06:03 AM

do ls -lR in the two directories and compare the result.

fergus · 04-03-2012, 06:56 AM

Thank you, but this isn't going to fly. ls -lr gives output like
-rw-r--r-- 1 fergus hdd-7 636 Mar 31 07:43 geese
-rw-r--r-- 1 fergus hdd-7 539 Mar 31 07:43 hansen
-rw-r--r-- 1 fergus hdd-7 570 Mar 31 07:43 hundal
so any changes in timestamp between dir1 and dir2 will blur the comparison. Also filenames are listed independently of the directory that contains them. You might as well have said "do find in the two directories and compare the result" which (a) provides better location-specific output; (b) is not blurred by timestamp data; and (c) is what essentially I am doing, but need to improve. Thanks all the same.

pan64 · 04-03-2012, 07:02 AM

Code:

cd <dir>; find . -type f

will give you a simple filelist, so you can compare those lists (maybe you need to sort them first)

fergus · 04-03-2012, 07:17 AM

Thanks again. But your suggestion "find/ sort/ compare ... a simple filelist" is exactly what's going on in my original post. The problem is that the output from the comparison is too verbose e.g. listing the entire contents of unmatched subdirectories rather than simply stating the fact of the unmatched directory, which is all one needs to know. My post is not about how to perform a comparison, it's about how to do it quickly cleanly and completely but also concisely. Thanks all the same.

pan64 · 04-03-2012, 07:47 AM

ok, maybe this time:

Code:

cd <dir>; find . -type dir -exec <script> {} \;

the script should look like this:

Code:

echo -n "$1 "
ls -1 $1 |  awk ' { a = a $0 } END { print a } '

this will generate one line for every dir, and now you can compare the output of the two find dir by dir.

I think you need to execute find to have the list of all the dirs, and than you need to generate a text in these dirs to be able to compare the contents as you like.

colucix · 04-03-2012, 08:14 AM

Another option is rsync (with the --dry-run option) and it should be fast:

Code:

rsync --dry-run -O -av --ignore-existing /dir1/* /dir2 | sed -n '2,/^$/s/^/Only in dir1: /p'
rsync --dry-run -O -av --ignore-existing /dir2/* /dir1 | sed -n '2,/^$/s/^/Only in dir2: /p'

This should give an output similar to the "Only in" lines of the diff command. However you have to run the command twice, inverting the two directories in order to retrieve files only in dir1 and then files only in dir2. The sed command removes the statistics from the last two lines of the rsync output and adds the proper "Only in" string. The only caveat is the presence of the blank lines between the last file name and the statistics: it serves to remove the unwanted lines from the rsync output, but it should be removed afterwards. Just to give you an idea!

Edit: a simple sed command can remove the last unwanted line:

Code:

rsync --dry-run -O -av --ignore-existing /dir1/* /dir2 | sed -n '2,/^$/s/^/Only in dir1: /p' | sed '$d'
rsync --dry-run -O -av --ignore-existing /dir2/* /dir1 | sed -n '2,/^$/s/^/Only in dir2: /p' | sed '$d'

fergus · 04-03-2012, 10:56 AM

Thank you. This looked really convenient. But for any non-matching folder under dir1 or dir2, which is all one needs to know, rsync still provides, additionally to identifying it, a complete listing of all files and subdirectories contained in it. Again, superfluous information (and potentially many 00s or 000s of lines of it)! I have played with all possible switches .. I think .. without managing to suppress this.
Driving me nuts ... I am wondering whether another approach would be to pipe the sorted information to a text editor, that might have the facility to in some sense "recognise headings" by their shape, and suppress all "section contents" thereunder. So for example in the listing
/dir1/a
/dir1/a/b/c/file1
/dir1/a/file1
/dir1/f/file2
/dir1/f/g/file3
the 2nd and 3rd lines would be suppressed.
(In which case one might as well revert to the output from comparing the two lists from "find" and edit that output in the same way.)
Thank you.