The power of unix find -exec : Lessons learned
I often copy/backup large directory trees to a backup NAS using rsync.
To verify the contents of the copied directory tree i use 'find' together with 'md5sum' 'sha1sum' et cetera on both source and the copy, and then sort the two outputs to finally compare them with 'diff'. I find that to be the fastest way to verify file contents for copied directory trees. Rsync does not by default verify file contents. It looks at modtime and file size. One can instruct rsync to use --checksum but i rather use MD5, and often in combination with sha1 sum. Today i learnt some valuable lessons. (See end of post) What i did was to first backup the files like: Code:
rsync -av --delete /home /mnt/nas1/backup/ Then i verify the copied data contents (to rule out faults from bad drives, memory errors etc...) with, for example, md5sum by running following command on the source: Code:
pushd / && find home -exec md5sum '{}' \; | tee ~/bck.source.home.md5 ; popd (Possibly run locally on the NAS server for speedier operation). Code:
pushd /mnt/nas1/backup && find home -exec md5sum '{}' \; | tee ~/bck.dest.home.md5 ; popd Code:
sort < ~/bck.source.home.md5 > ~/bck.source.home.s.md5 & \ I fired off the backup script and then went for a long break. Coffe & beer & some nice company... I could of course do the verified copy in numerous other ways, but the above is a rather fast and secure way to do this. Well... That's what i thought. As it happens, today, i had a full directory tree copy of another linux systems root drive somewhere in the source tree. (I had put it there some months ago to debug some c-code handling stat on devices). When i ran the find...md5sum command combo to get md5 checksums it got into that copy of a root disk, and into the dev directory. ooops! :p When getting back from the coffe break i found myself locked out of the system. Screen locked and no reaction on any keys pressed on the keyboard. Tried the usual, like new keyboard batteries, reselect on the KVM switch, move around the kbd/mouse USB connector for the KVM switch, etcetera. No luck! Logged in via SSH from another terminal and did a "ps ax". And found: Code:
. Swiftly i issued the command "kill -9 26831" to kill the md5sum process hijacking the console keyboard through the device file for another system in a dev directory very well outside the /dev of the running system. (growl...) :mad: Then i immediately got back control of the console keyboard. Then logged in from the screen saver and the first thing i saw was the console log window. Full of things like: Code:
Jun 21 23:55:42 somepc kernel: w83877f_wdt: WDT driver for W83877F initialised. timeout=30 sec (nowayout=0) After a quick look again at "ps ax" i did a "kill -9 12396" to terminate the find process sifting away in my device special files through the copy of another systems /dev directory. Now i have to do some tidying up, to make sure nothing bad happened to the system. The fact that md5sum only did some file reads says i should be fine. But the fact that the sifting through the device special files triggered modprobes gives me an eary feeling. So the lessons i learnt are: 1 - Dont just write "find . -exec abcd..." but tell find to disregard all that is not regular files by something like "find . -type f -exec abcd...". And i'm not quite sure it is safe even with the -type option. Is there a better way to do this? 2 - The /dev directory contains also a lot of regular files. 3 - Device special files can really appear anywhere in the unix file system. 4 - The device special files does not have to be in the /dev directory, nor do they need to come from the same type/version unix/linux system, to be active in the system. It makes one think. What if a hacker somehow manages to upload a device special file to a server? 5 - When in an emergency, THINK before you act hastily on the first issue you "find". ;) |
Hi there,
Very valid observations! Besides device files, you can also run into trouble with other file types unix domain sockets etc. You probably also don't want to run md5sum unnecessarily on symbolic links. The "-type f" is the correct solution, as it will ignore all of these other file types. Just one suggestion regarding the use of "-exec". This runs the md5sum process once for every file. It's a lot more CPU efficient to spawn fewer processes, and ask each of those to process more than one file. This can be achieved with the help of the xargs command. Your command: Code:
find home -exec md5sum '{}' \; Code:
find home -print | xargs md5sum Clifford |
Quote:
Code:
find home -exec md5sum '{}' +; Quote:
|
Thank you both for the input.
I have used the "find home -print | xargs md5sum" construct before. But found out that when sifting through large amount of data it is often better to use the construct "find home -exec md5sum '{}' \;". Example: Code:
find haystack -iname '*.cxx' -exec grep -i needle '{}' \; The grep will be executed instantly when a '*.cxx' file is found, and grep is run without memory consumption issues one file at a time. When i have found what i want, i can just terminate the command, even before "find" has sifted through all files of the haystack. If i where to use: Code:
find haystack -iname '*.cxx' -print | xargs grep -i needle If i where to use a single "grep -r" command on the whole "haystack" i will with a large probability get a memory hog eating up all memory and swap before dying off without outputting the slightest match of needle. Also the following construct is very useful for getting both md5 and sha1 sums of a large amount of files: Code:
find haystack -type f -exec md5sum '{}' >> haystack.md5 \; -exec sha1sum '{}' >> haystack.sha1 \; The difference being to either get one file path argument at a time, or all file path arguments on a single line. THe drawback seems to be that it is difficult getting any stats of the many invoked processes running on small files. It does not show up when using "top". If one wants to evaluate resource usage of the invoked process, then it is better to use "xargs" or the "-exec ... +;" construct. It depends on what you want. |
*Even easier would be to just use a tool that 0) generates hashes, 1) descends into subdirectories, 2) allows you to select what file types to hash and 3) can compare hashes showing only differences: md5deep (+SHA1, SHA256, TIGER, Whirlpool). In two lines:
Code:
md5deep -o f -r /home > /tmp/home.md5 && scp /tmp/home.md5 user@NAS:/tmp/ |
Quote:
And md5deep is normally used for finding similarities between files, instead of exact matches. As a side note, i am tinkering with a disk forensic tool, and i think md5deep may be a help to get better guesses on repairing damaged files by finding similar files/clusters and setting a more narrow bound for guessing correct contents when you got hashes on the files. It could for instance find the wanted data in an "empty" cluster on an NTFS disk because the damaged file used that cluster before defragmentation. Would work for both bad sectors and "bit rot" through bad DRAM. But that is offtopic.In this case i am interested in a fast, portable, standardized way to match file contents with exact match on, (mainly), md5 and sha1 sums. It is the nearest step to a binary diff of the files, but executing much faster, and with enough reliability to capture any transfer/storage errors. |
Why not just run rsync a second time with identical parameters immediately after the first run, but adding the -c option (the option for "checksum") and use a script to verify that no additional files were copied? Depending on what you are copying, there could of course be some file differences in two sequential rsync runs, but you could probably account for that in your analysis of the second rsync run. I think that the "verbose" option will list the specific files that are copied, won't it? It's been a while since I set up an rsync and I don't remember.
|
Yes, the rsync option -c or long option --checksum will compare the contents or source/destination files using a simple checksum. Which is sometimes quite enough.
But lets say we before had copied a directory tree using "rsync -av --delete source destination/", and then later we want to reapply the rsync because some files on source has changed. If we then use "rsync -av -c --delete source destination/" we will have to wait for all of source and destination to be checksummed again at the same time we update the destination. Takes some time. Reapplying "rsync -av --delete source destination/" where only a few of the files has been changed is stumbling fast. Could be copying only 100kB instead of several terabytes of data. And if i have a file of md5 (or sha1...) hashes, then it is a simple task to rechecked only source or destination again at any time. Or to grep for the hash for a particular file and check only that file. If only one of the files have changed since the copy started, then it is easy to recheck that particular file and patch the hash files. And md5 is a stronger hash than simple checksumming or even crc32. And with the hash file i can also detect if something has changed from what it was when i made the hashing. Not only comparing for equality, but hashing for integrity. Could be bit rot. Could also be a virus manipulating all instances of a certain file it can find on the system. And the hash file is valid for both source and destination. Even if the source happens to be deleted, and the destination copy spend some years on a bookshelf. It can easily be checked for integrity with the hash file. And yes, the rsync option -v will be verbose about files is need to update or delete. You can make a quick check for changes in modtime and file size with adding the option -n to rsync. The command "rsync -avn --delete source destination/" will report on deletions and/or changes in modtime or file size without doing any copying. |
Quote:
Quote:
Quote:
|
All times are GMT -5. The time now is 03:41 PM. |