The power of unix find -exec : Lessons learned

wroom · 06-21-2013, 07:21 PM

I often copy/backup large directory trees to a backup NAS using rsync.

To verify the contents of the copied directory tree i use 'find' together with 'md5sum' 'sha1sum' et cetera on both source and the copy, and then sort the two outputs to finally compare them with 'diff'. I find that to be the fastest way to verify file contents for copied directory trees.

Rsync does not by default verify file contents. It looks at modtime and file size. One can instruct rsync to use --checksum but i rather use MD5, and often in combination with sha1 sum.

Today i learnt some valuable lessons. (See end of post)

What i did was to first backup the files like:

Code:

rsync -av --delete /home /mnt/nas1/backup/

Which makes a copy of the /home directory tree on a volume at the NAS server.

Then i verify the copied data contents (to rule out faults from bad drives, memory errors etc...) with, for example, md5sum by running following command on the source:

Code:

pushd / && find home -exec md5sum '{}' \; | tee ~/bck.source.home.md5 ; popd

And at the same time running the following command on the copied data:
(Possibly run locally on the NAS server for speedier operation).

Code:

pushd /mnt/nas1/backup && find home -exec md5sum '{}' \; | tee ~/bck.dest.home.md5 ; popd

Then i sort the md5sum report files, and compare them with diff:

Code:

sort < ~/bck.source.home.md5 > ~/bck.source.home.s.md5 & \
 sort < ~/bck.dest.home.md5 > ~/bck.dest.home.s.md5 & wait ; \
 diff ~/bck.source.home.s.md5 ~/bck.dest.home.s.md5 && echo "SUCCESS"

Either it end with outputting "SUCCESS" on the console, or it displays the filenames and md5 sums that differ.
I fired off the backup script and then went for a long break. Coffe & beer & some nice company...

I could of course do the verified copy in numerous other ways, but the above is a rather fast and secure way to do this.

Well... That's what i thought. As it happens, today, i had a full directory tree copy of another linux systems root drive somewhere in the source tree.
(I had put it there some months ago to debug some c-code handling stat on devices).

When i ran the find...md5sum command combo to get md5 checksums it got into that copy of a root disk, and into the dev directory. ooops!

When getting back from the coffe break i found myself locked out of the system. Screen locked and no reaction on any keys pressed on the keyboard. Tried the usual, like new keyboard batteries, reselect on the KVM switch, move around the kbd/mouse USB connector for the KVM switch, etcetera. No luck!

Logged in via SSH from another terminal and did a "ps ax".
And found:

Code:

.
.
.
12112 pts/17   Ss     0:00 /bin/bash
12394 pts/17   S+     0:00 /bin/sh ./backtree.sh
12396 pts/17   S+     0:07 find -exec md5sum '{}' ;
12466 pts/6    Ss+    0:00 /bin/bash
13818 pts/8    Ss     0:00 /bin/bash
16154 pts/16   Ss     0:00 /bin/bash
16855 ?        S<     0:00 [bond0]
16999 pts/9    Ss     0:00 /bin/bash
17180 ?        S      0:00 /usr/bin/knotify4
20205 ?        S<     0:00 [bond0]
20792 ?        S<     0:00 [bond0]
21530 ?        S<     0:00 [bond0]
23529 ?        S<     0:00 [bond0]
23823 ?        S<     0:00 [bond0]
24907 ?        S<     0:00 [bond0]
25075 ?        S<     0:00 [bond0]
25408 ?        S<     0:00 [bond0]
26831 pts/17   S+     0:00 md5sum ./somepath/sdf2/lib/udev/devices/console
27126 ?        S<     2:02 [bond0]
30187 ?        S      0:06 [pdflush]
30222 ?        S      0:01 [pdflush]

(Edited out sensitive data in above).

Swiftly i issued the command "kill -9 26831" to kill the md5sum process hijacking the console keyboard through the device file for another system in a dev directory very well outside the /dev of the running system. (growl...)

Then i immediately got back control of the console keyboard. Then logged in from the screen saver and the first thing i saw was the console log window.
Full of things like:

Code:

Jun 21 23:55:42 somepc kernel: w83877f_wdt: WDT driver for W83877F initialised. timeout=30 sec (nowayout=0)
Jun 21 23:55:42 somepc modprobe: WARNING: Error inserting i6300esb (/lib/modules/2.6.25.20-0.7-default/kernel/drivers/watchdog/i6300esb.ko): No such device
Jun 21 23:55:42 somepc kernel: sc520_wdt: cannot register miscdev on minor=130 (err=-16)
Jun 21 23:55:42 somepc modprobe: WARNING: Error inserting sc520_wdt (/lib/modules/2.6.25.20-0.7-default/kernel/drivers/watchdog/sc520_wdt.ko): Device or resource busy
Jun 21 23:55:42 somepc kernel: machzwd: MachZ ZF-Logic Watchdog driver initializing.
Jun 21 23:55:42 somepc kernel: machzwd: no ZF-Logic found
Jun 21 23:55:42 somepc modprobe: WARNING: Error inserting machzwd (/lib/modules/2.6.25.20-0.7-default/kernel/drivers/watchdog/machzwd.ko): No such device
Jun 21 23:55:42 somepc kernel: WDT driver for Acquire single board computer initialising.
Jun 21 23:55:42 somepc kernel: acquirewdt: I/O address 0x0043 already in use
Jun 21 23:55:42 somepc kernel: acquirewdt: probe of acquirewdt failed with error -5
.
.
.

OUCH!

After a quick look again at "ps ax" i did a "kill -9 12396" to terminate the find process sifting away in my device special files through the copy of another systems /dev directory.

Now i have to do some tidying up, to make sure nothing bad happened to the system. The fact that md5sum only did some file reads says i should be fine.
But the fact that the sifting through the device special files triggered modprobes gives me an eary feeling.

So the lessons i learnt are:
1 - Dont just write "find . -exec abcd..." but tell find to disregard all that is not regular files by something like "find . -type f -exec abcd...".
And i'm not quite sure it is safe even with the -type option. Is there a better way to do this?

2 - The /dev directory contains also a lot of regular files.

3 - Device special files can really appear anywhere in the unix file system.

4 - The device special files does not have to be in the /dev directory, nor do they need to come from the same type/version unix/linux system, to be active in the system.
It makes one think. What if a hacker somehow manages to upload a device special file to a server?

5 - When in an emergency, THINK before you act hastily on the first issue you "find".

cliffordw · 06-22-2013, 02:11 AM

Hi there,

Very valid observations! Besides device files, you can also run into trouble with other file types unix domain sockets etc. You probably also don't want to run md5sum unnecessarily on symbolic links. The "-type f" is the correct solution, as it will ignore all of these other file types.

Just one suggestion regarding the use of "-exec". This runs the md5sum process once for every file. It's a lot more CPU efficient to spawn fewer processes, and ask each of those to process more than one file. This can be achieved with the help of the xargs command. Your command:

Code:

find home -exec md5sum '{}' \;

then becomes:

Code:

find home -print | xargs md5sum

Regards,

Clifford

fl0 · 06-22-2013, 03:26 AM

Quote:

find home -print | xargs md5sum

or insted of invoking another programm, use

Code:

find home -exec md5sum '{}' +;

Quote:

-exec command {} +
This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the
total number of invocations of the command will be much less than the number of matched files. The command line is built in much the same way that xargs builds its
command lines. Only one instance of `{}' is allowed within the command. The command is executed in the starting directory.

wroom · 06-22-2013, 05:59 AM

Thank you both for the input.

I have used the "find home -print | xargs md5sum" construct before. But found out that when sifting through large amount of data it is often better to use the construct "find home -exec md5sum '{}' \;".

Example:

Code:

find haystack -iname '*.cxx' -exec grep -i needle '{}' \;

The above command goes through the potentially huge "haystack" and picks out the '*.cxx' files one at a time and grep it for "needle".
The grep will be executed instantly when a '*.cxx' file is found, and grep is run without memory consumption issues one file at a time. When i have found what i want, i can just terminate the command, even before "find" has sifted through all files of the haystack.

If i where to use:

Code:

find haystack -iname '*.cxx' -print | xargs grep -i needle

Then i would have to wait for "find" to first sift through all the files of "haystack" before seeing any output of grep. And grep would grab lots of memory for its heap which it does not return before it ends.

If i where to use a single "grep -r" command on the whole "haystack" i will with a large probability get a memory hog eating up all memory and swap before dying off without outputting the slightest match of needle.

Also the following construct is very useful for getting both md5 and sha1 sums of a large amount of files:

Code:

find haystack -type f -exec md5sum '{}' >> haystack.md5 \; -exec sha1sum '{}' >> haystack.sha1 \;

The invocations of md5sum and sha1sum above is synchronous per file, and so they share the file caching, and run through about twice as fast compared to running them one after the other.

The difference being to either get one file path argument at a time, or all file path arguments on a single line.

THe drawback seems to be that it is difficult getting any stats of the many invoked processes running on small files. It does not show up when using "top". If one wants to evaluate resource usage of the invoked process, then it is better to use "xargs" or the "-exec ... +;" construct.

It depends on what you want.

unSpawn · 06-22-2013, 06:27 AM

*Even easier would be to just use a tool that 0) generates hashes, 1) descends into subdirectories, 2) allows you to select what file types to hash and 3) can compare hashes showing only differences: md5deep (+SHA1, SHA256, TIGER, Whirlpool). In two lines:

Code:

md5deep -o f -r /home > /tmp/home.md5 && scp /tmp/home.md5 user@NAS:/tmp/
# ssh user@NAS
md5deep -o f -r /mnt/nas1/backup/home -x /tmp/home.md5

wroom · 06-22-2013, 08:23 AM

Quote:

Originally Posted by unSpawn

*Even easier would be to just use a tool that 0) generates hashes, 1) descends into subdirectories, 2) allows you to select what file types to hash and 3) can compare hashes showing only differences: md5deep (+SHA1, SHA256, TIGER, Whirlpool). In two lines:

Code:

md5deep -o f -r /home > /tmp/home.md5 && scp /tmp/home.md5 user@NAS:/tmp/
# ssh user@NAS
md5deep -o f -r /mnt/nas1/backup/home -x /tmp/home.md5

Yes. But this particular system does not have md5deep installed. Otherwise it could be used.

And md5deep is normally used for finding similarities between files, instead of exact matches.

As a side note, i am tinkering with a disk forensic tool, and i think md5deep may be a help to get better guesses on repairing damaged files by finding similar files/clusters and setting a more narrow bound for guessing correct contents when you got hashes on the files. It could for instance find the wanted data in an "empty" cluster on an NTFS disk because the damaged file used that cluster before defragmentation. Would work for both bad sectors and "bit rot" through bad DRAM. But that is offtopic.

In this case i am interested in a fast, portable, standardized way to match file contents with exact match on, (mainly), md5 and sha1 sums. It is the nearest step to a binary diff of the files, but executing much faster, and with enough reliability to capture any transfer/storage errors.

haertig · 06-22-2013, 08:47 AM

Why not just run rsync a second time with identical parameters immediately after the first run, but adding the -c option (the option for "checksum") and use a script to verify that no additional files were copied? Depending on what you are copying, there could of course be some file differences in two sequential rsync runs, but you could probably account for that in your analysis of the second rsync run. I think that the "verbose" option will list the specific files that are copied, won't it? It's been a while since I set up an rsync and I don't remember.

wroom · 06-22-2013, 09:33 AM

Yes, the rsync option -c or long option --checksum will compare the contents or source/destination files using a simple checksum. Which is sometimes quite enough.

But lets say we before had copied a directory tree using "rsync -av --delete source destination/", and then later we want to reapply the rsync because some files on source has changed. If we then use "rsync -av -c --delete source destination/" we will have to wait for all of source and destination to be checksummed again at the same time we update the destination. Takes some time.

Reapplying "rsync -av --delete source destination/" where only a few of the files has been changed is stumbling fast.
Could be copying only 100kB instead of several terabytes of data.

And if i have a file of md5 (or sha1...) hashes, then it is a simple task to rechecked only source or destination again at any time. Or to grep for the hash for a particular file and check only that file. If only one of the files have changed since the copy started, then it is easy to recheck that particular file and patch the hash files.

And md5 is a stronger hash than simple checksumming or even crc32.
And with the hash file i can also detect if something has changed from what it was when i made the hashing. Not only comparing for equality, but hashing for integrity. Could be bit rot. Could also be a virus manipulating all instances of a certain file it can find on the system.

And the hash file is valid for both source and destination. Even if the source happens to be deleted, and the destination copy spend some years on a bookshelf. It can easily be checked for integrity with the hash file.

And yes, the rsync option -v will be verbose about files is need to update or delete.
You can make a quick check for changes in modtime and file size with adding the option -n to rsync.
The command "rsync -avn --delete source destination/" will report on deletions and/or changes in modtime or file size without doing any copying.

unSpawn · 06-22-2013, 10:18 AM

Quote:

Originally Posted by wroom

Yes. But this particular system does not have md5deep installed. Otherwise it could be used.

Sure there's always a catch ;-p

Quote:

Originally Posted by wroom

And md5deep is normally used for finding similarities between files, instead of exact matches.

Depends on your definition of "normally"... I often use piece-wise mode and negation.

Quote:

Originally Posted by wroom

As a side note, i am tinkering with a disk forensic tool, and i think md5deep may be a help to get better guesses on repairing damaged files by finding similar files/clusters and setting a more narrow bound for guessing correct contents when you got hashes on the files. It could for instance find the wanted data in an "empty" cluster on an NTFS disk because the damaged file used that cluster before defragmentation. Would work for both bad sectors and "bit rot" through bad DRAM. But that is offtopic.

See piece-wise mode and choose a small enough nibble size.