A tool to manage a number of almost-the-same hard disks?

phugoid · 11-11-2008, 12:41 PM

Before I start writing something from scratch, I'm wondering if any of you know of a program/script to help manage large sets of files that are almost exactly the same.

For example, say I have 20 machines that start out with identical hard disk images. Then over time they diverge, and I'd like to compare them and pull out all the differences. The tool should work on a sub-directory as well, so that I can compare for example the /home/dan directory across all 20 machines.

I'm tempted to start writing some scripts for this. I would recurse through the directory structure and take md5 hashes of all the files and everything I want to know (directory listing with file permissions, etc.). I would store this "fingerprint" of the hard disk in a text file. Then I could use other scripts to help compare two or more fingerprints (from two or more hard disks). Naturally, I want the ability to exclude certain directories and files from the fingerprint (/proc, etc.).

I would greatly appreciate any ideas, advice and criticism at this point.

irishbitte · 11-11-2008, 12:57 PM

What might help you out here is a program like rsync? It's definitely not straightforward! What I have started doing of late is to create an image using PartImage of the ideal install, then re-installing that on machines across the network, again using PartImage.

rsync will not tell you the differences per se, but if you look in the code for it, you might get some inspiration.

Poetics · 11-11-2008, 01:58 PM

Were I attempting to do this, I'd do exactly what you are thinking, with the hashes. Some creative finds can exclude any directory you'd like and piping the output to whatever file you want isn't difficult. The 'challenge' is in writing the code to compare the files.

I'm thinking each line in the hash file would be "/path/to/file hash" for easy parsing. Break it up by tabs and now you can compare the two files quickly using perl or whatever language is your preference. Push the discrepancies into an array, and push any files that aren't in one to another array. Then again, I'm a fan of making my output nicely organized.

Then again, couldn't you just use 'diff' on the hash files?

phugoid · 11-14-2008, 11:39 PM

Thanks, both of you.

irishbitte: I will indeed have a look at rsync for some inspiration, thanks. In fact, rsync can probably do most of what I need. One advantage of my approach is that you could build a "fingerprint" file for a hard disk, and then disconnect the hard disk and just use that file for comparison purposes. As obscure as it may sound, that's a very interesting feature to me right now.

Poetics: agreed, one challenge is in presenting this info in a useful format. Thankfully this is a separate problem from building the hash file. Thanks for the reassurance - I will continue on this path.